Site Reliability Engineer
About the Role:
We are seeking a highly skilled and experienced Site Reliability Engineer to lead our reliability and operations initiatives. In this role, you will provide technical leadership and mentorship to the operations team, setting standards for reliability, automation, and scalability.
This is an opportunity to combine leadership with engineering excellence – you will design and implement systems that scale, while ensuring the operations team has the guidance and expertise they need to succeed.
Key Responsibilities:
* Act as technical lead for the operations team, setting standards for reliability, automation, and scalability.
* Mentor and guide engineers, fostering knowledge sharing and technical growth.
* Lead incident response, root cause analysis, and ensure postmortem learnings are translated into improvements.
* Collaborate closely with development and product teams to balance agility with operational stability.
Hands-on Engineering:
* Infrastructure as Code (IaC): Build and manage infrastructure with Terraform; maintain and support legacy Ansible where needed.
* Kubernetes & Orchestration: Operate and optimize Kubernetes clusters, leveraging Argo CD and Argo Workflows for GitOps.
* CI/CD: Develop GitHub Actions pipelines and oversee the migration away from legacy Octopus Deploy.
* Systems Administration: Manage Linux and Windows Server systems, ensuring performance, reliability, and security.
* Monitoring & Observability: Own monitoring and observability solutions with Prometheus, Grafana, and OpenTelemetry; define and track SLOs/SLIs.
* Databases & Caching: Operate MSSQL, PostgreSQL, and Redis in production environments.
* Networking & Security: Manage WAF and CDN services (Cloudflare) and drive secure infrastructure practices.
Qualifications:
* Proven experience as a Site Reliability Engineer, DevOps Engineer, or Infrastructure Engineer with technical leadership responsibilities.
* Strong Cloud platform experience using Azure.
* Strong expertise in Terraform; Ansible familiarity a plus.
* Hands-on with Kubernetes and GitOps workflows (Argo CD/Workflows).
* Skilled in both Linux and Windows Server environments.
* Experienced with CI/CD pipelines, particularly GitHub Actions.
* Deep understanding of monitoring/observability (Prometheus, Grafana, OpenTelemetry).
* Strong incident management and troubleshooting skills in distributed systems.
* Experience maintaining and scaling High-Traffic Web Applications.
* Excellent collaboration and communication skills, with experience mentoring other engineers.
* You are able to work in a full-remote setup.
Nice to Have:
* Programming/scripting skills.
* Performance tuning and capacity planning expertise.
* Familiarity with compliance, governance, and security standards.