Site Reliability Engineer - Production Support

Toronto, ON, Canada

Contracted

Experienced

Role: Site Reliability Engineer - Production Support
Rate Max for $50/hr.
Position Overview

seeks a skilled and experienced Production Support Engineer through vendor staffing to support our digital applications. This role combines hands-on production support with Site Reliability Engineering (SRE) principles, focusing on toil elimination, infrastructure automation, and ensuring high availability of critical digital applications and backend systems.

Primary Responsibilities

1. Toil Removal & Infrastructure Maintenance (15%)

· Execute SSL/TLS certificate updates and renewals across production environments

· Perform Windows and Linux server patching and security updates

· Manage NPID password updates and credential rotation protocols

· Implement security vulnerability remediation in production systems

· Identify, document, and eliminate repetitive manual operational tasks

2. Infrastructure & Database Cluster Management (20%)

· Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)

· Administer MongoDB clusters including replication, sharding, backup, recovery, and maintenance

· Operate and maintain Redis instances for caching and session management

· Monitor cluster health, capacity planning, and optimization

· Execute failover and disaster recovery procedures

· Ensure data integrity and backup compliance

3. Automation & SRE Activities (15%)

· Develop, maintain, and enhance Ansible playbooks for infrastructure automation

· Build infrastructure-as-code solutions to reduce manual intervention

· Create and maintain comprehensive runbooks and operational playbooks

· Design monitoring, alerting, and observability solutions

· Implement automated remediation for common operational issues

· Quantify and prioritize toil reduction opportunities

4. Production Application Support (50%)

· Troubleshoot and resolve production incidents affecting digital applications

· Collaborate with application development and support teams on issue diagnosis

· Participate in incident response, root cause analysis, and post-mortems

· Monitor and respond to application performance degradation

---

Technical Requirements

Required Expertise (Must-Have)

· Ansible: 2+ years hands-on experience writing playbooks, roles, and automation workflows

· Elasticsearch: 2+ years managing and troubleshooting Elasticsearch clusters in production

· MongoDB: 2+ years with replica sets, sharding, backup/recovery, and performance tuning

· Redis: Proficiency in deployment, configuration, and operational support

· OpenShift: Experience deploying and managing containerized applications on OpenShift

· Azure: Knowledge of Azure cloud services, resource management, and deployments

· Linux Administration: 3+ years with RHEL, CentOS, or Ubuntu in production environments

· Windows Server Administration: Experience with patching, certificate management, and maintenance

· Shell Scripting: Bash scripting for automation and operational tasks

· Incident Management: Experience responding to and resolving critical production incidents

Preferred Skills

· Kubernetes or container orchestration platforms

· Python or Go scripting for automation

· CI/CD pipeline experience (Jenkins, GitLab CI, Azure DevOps)

· Monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)

· Infrastructure-as-Code tools (Terraform, CloudFormation)

· Security best practices and vulnerability management

· Relevant certifications (AZ-900, CKA, Elasticsearch, etc.)

---

Required Qualifications

· Minimum 5 years of production infrastructure support or SRE experience

· Minimum 3 years with at least 2 of the core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)

· Experience working in regulated financial services environment (preferred)

· Ability to work independently and in teams

· Strong troubleshooting and analytical capabilities

· Excellent documentation and communication skills

· Must be available for on-call support rotation (with reasonable notice)

---

Operational Expectations

· On-Call Rotation: Participates in production support on-call schedule

· Incident Response: Available for critical incident resolution outside standard business hours as required

· Availability: Core business hours + flexibility for critical production issues

· Response Time: First response to critical incidents within 30 minutes

· Documentation: Maintains detailed runbooks, playbooks, and knowledge base articles

· Collaboration: Regular communication with infrastructure, development, and operations teams

Apply for this position

Required*

First Name*

Last Name*

Email Address*

Phone*

Address

Resume*

We've received your resume. Click here to update it.

Attach resume or Paste resume

Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*

Submit Application

Aarorn Technologies Inc.

Thanks for visiting our Career Page. Please review our open positions and apply to the positions that match your qualifications.

Site Reliability Engineer - Production Support

Apply for this position