About the Role
We are seeking a seasoned Reliability Engineering Lead to spearhead our reliability and operations initiatives.
This is an opportunity to combine leadership with engineering expertise – you'll design and implement systems that scale, while ensuring your team has the guidance and expertise they need to succeed.
Technical Leadership:
* Act as the technical lead for the operations team, setting standards for reliability, automation, and scalability.
* Mentor and guide engineers, fostering knowledge sharing and technical growth.
* Lead incident response, root cause analysis, and ensure postmortem learnings are translated into improvements.
* Collaborate closely with development and product teams to balance agility with operational stability.
* Infrastructure as Code (IaC): Build and manage infrastructure with Terraform; maintain and support legacy Ansible where needed.
* Kubernetes & Orchestration: Operate and optimize Kubernetes clusters, leveraging Argo CD and Argo Workflows for GitOps.
* CI/CD: Develop GitHub Actions pipelines and oversee the migration away from legacy Octopus Deploy.
* Systems Administration: Manage Linux and Windows Server systems, ensuring performance, reliability, and security.
* Monitoring & Observability: Own monitoring and observability solutions with Prometheus, Grafana, and OpenTelemetry; define and track SLOs/SLIs.
* Databases & Caching: Operate MSSQL, PostgreSQL and Redis in production environments.
* Networking & Security: Manage WAF and CDN services (Cloudflare) and drive secure infrastructure practices.
Requirements
* Proven experience as a Site Reliability Engineer, DevOps Engineer, or Infrastructure Engineer with technical leadership responsibilities.
* Strong Cloud platform experience using Azure.
* Strong expertise in Terraform; Ansible familiarity a plus.
* Hands-on with Kubernetes and GitOps workflows (Argo CD/Workflows).
* Skilled in both Linux and Windows Server environments.
* Experienced with CI/CD pipelines, particularly GitHub Actions.
* Deep understanding of monitoring/observability (Prometheus, Grafana, OpenTelemetry).
* Strong incident management and troubleshooting skills in distributed systems.
* Experience maintaining and scaling High-Traffic Web Applications.
* Excellent collaboration and communication skills, with experience mentoring other engineers.
* You are able to work in a full-remote setup.
Benefits
* Take a technical leadership role while staying hands-on with cutting-edge SRE practices.
* Shape the reliability roadmap and mentor a skilled operations team.
* Work across a diverse, modern stack while leading the transition from legacy systems.