Site Reliability Engineer - Production Support
Role: Site Reliability Engineer - Production Support
Rate Max for $50/hr.
Position Overview
seeks a skilled and experienced Production Support Engineer through vendor staffing to support our digital applications. This role combines hands-on production support with Site Reliability Engineering (SRE) principles, focusing on toil elimination, infrastructure automation, and ensuring high availability of critical digital applications and backend systems.
Primary Responsibilities
1. Toil Removal & Infrastructure Maintenance (15%)
· Execute SSL/TLS certificate updates and renewals across production environments
· Perform Windows and Linux server patching and security updates
· Manage NPID password updates and credential rotation protocols
· Implement security vulnerability remediation in production systems
· Identify, document, and eliminate repetitive manual operational tasks
2. Infrastructure & Database Cluster Management (20%)
· Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)
· Administer MongoDB clusters including replication, sharding, backup, recovery, and maintenance
· Operate and maintain Redis instances for caching and session management
· Monitor cluster health, capacity planning, and optimization
· Execute failover and disaster recovery procedures
· Ensure data integrity and backup compliance
3. Automation & SRE Activities (15%)
· Develop, maintain, and enhance Ansible playbooks for infrastructure automation
· Build infrastructure-as-code solutions to reduce manual intervention
· Create and maintain comprehensive runbooks and operational playbooks
· Design monitoring, alerting, and observability solutions
· Implement automated remediation for common operational issues
· Quantify and prioritize toil reduction opportunities
4. Production Application Support (50%)
· Troubleshoot and resolve production incidents affecting digital applications
· Collaborate with application development and support teams on issue diagnosis
· Participate in incident response, root cause analysis, and post-mortems
· Monitor and respond to application performance degradation
---
Technical Requirements
Required Expertise (Must-Have)
· Ansible: 2+ years hands-on experience writing playbooks, roles, and automation workflows
· Elasticsearch: 2+ years managing and troubleshooting Elasticsearch clusters in production
· MongoDB: 2+ years with replica sets, sharding, backup/recovery, and performance tuning
· Redis: Proficiency in deployment, configuration, and operational support
· OpenShift: Experience deploying and managing containerized applications on OpenShift
· Azure: Knowledge of Azure cloud services, resource management, and deployments
· Linux Administration: 3+ years with RHEL, CentOS, or Ubuntu in production environments
· Windows Server Administration: Experience with patching, certificate management, and maintenance
· Shell Scripting: Bash scripting for automation and operational tasks
· Incident Management: Experience responding to and resolving critical production incidents
Preferred Skills
· Kubernetes or container orchestration platforms
· Python or Go scripting for automation
· CI/CD pipeline experience (Jenkins, GitLab CI, Azure DevOps)
· Monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)
· Infrastructure-as-Code tools (Terraform, CloudFormation)
· Security best practices and vulnerability management
· Relevant certifications (AZ-900, CKA, Elasticsearch, etc.)
---
Required Qualifications
· Minimum 5 years of production infrastructure support or SRE experience
· Minimum 3 years with at least 2 of the core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)
· Experience working in regulated financial services environment (preferred)
· Ability to work independently and in teams
· Strong troubleshooting and analytical capabilities
· Excellent documentation and communication skills
· Must be available for on-call support rotation (with reasonable notice)
---
Operational Expectations
· On-Call Rotation: Participates in production support on-call schedule
· Incident Response: Available for critical incident resolution outside standard business hours as required
· Availability: Core business hours + flexibility for critical production issues
· Response Time: First response to critical incidents within 30 minutes
· Documentation: Maintains detailed runbooks, playbooks, and knowledge base articles
· Collaboration: Regular communication with infrastructure, development, and operations teams