About the Role
We are looking for a Site Reliability Engineer (SRE) – Technical Lead to spearhead our reliability and operations initiatives. In this role, you'll remain deeply hands-on while providing technical leadership and mentorship to the operations team. You will set the technical direction, guide architecture decisions, and champion best practices in automation, observability, and resilience.
This is an opportunity to combine leadership with engineering excellence — you'll design and implement systems that scale, while ensuring the operations team has the guidance and expertise they need to succeed.
Technical Leadership:
* Act as the technical lead for the operations team, setting standards for reliability, automation, and scalability.
* Mentor and guide engineers, fostering knowledge sharing and technical growth.
* Lead incident response, root cause analysis, and ensure postmortem learnings are translated into improvements.
* Collaborate closely with development and product teams to balance agility with operational stability.
Hands-on Engineering:
* Infrastructure as Code (IaC): Build and manage infrastructure with Terraform; maintain and support legacy Ansible where needed.
* Kubernetes & Orchestration: Operate and optimize Kubernetes clusters, leveraging Argo CD and Argo Workflows for GitOps.
* CI/CD: Develop GitHub Actions pipelines and oversee the migration away from legacy Octopus Deploy.
* Systems Administration: Manage Linux and Windows Server systems, ensuring performance, reliability, and security.
* Monitoring & Observability: Own monitoring and observability solutions with Prometheus, Grafana, and OpenTelemetry; define and track SLOs/SLIs.
* Databases & Caching: Operate MSSQL, PostgreSQL and Redis in production environments.
* Networking & Security: Manage WAF and CDN services (Cloudflare) and drive secure infrastructure practices.
Qualifications
* Proven experience as a Site Reliability Engineer, DevOps Engineer, or Infrastructure Engineer with technical leadership responsibilities.
* Strong Cloud platform experience using Azure
* Strong expertise in Terraform ; Ansible familiarity a plus.
* Hands-on with Kubernetes and GitOps workflows (Argo CD/Workflows).
* Skilled in both Linux and Windows Server environments.
* Experienced with CI/CD pipelines, particularly GitHub Actions.
* Deep understanding of monitoring/observability (Prometheus, Grafana, OpenTelemetry).
* Strong incident management and troubleshooting skills in distributed systems.
* Experience maintaining and scaling High-Traffic Web Applications
* Excellent collaboration and communication skills, with experience mentoring other engineers.
* You are able to work in a full-remote setup
Nice to Have
* Programming/scripting skills.
* Performance tuning and capacity planning expertise.
* Familiarity with compliance, governance, and security standards.
Why Join Us?
* Take a technical leadership role while staying hands-on with cutting-edge SRE practices.
* Shape the reliability roadmap and mentor a skilled operations team.
* Work across a diverse, modern stack while leading the transition from legacy systems.