About the Role
We are looking for aSite Reliability Engineer (SRE) – Technical Leadto spearhead our reliability and operations initiatives.
In this role, you'll remain deeply hands-on while providing technical leadership and mentorship to the operations team.
You will set the technical direction, guide architecture decisions, and champion best practices in automation, observability, and resilience.
This is an opportunity to combine leadership with engineering excellence — you'll design and implement systems that scale, while ensuring the operations team has the guidance and expertise they need to succeed.
Technical Leadership:
Act as the technical lead for the operations team, setting standards for reliability, automation, and scalability.
Mentor and guide engineers, fostering knowledge sharing and technical growth.
Lead incident response, root cause analysis, and ensure postmortem learnings are translated into improvements.
Collaborate closely with development and product teams to balance agility with operational stability.
Hands-on Engineering:
Infrastructure as Code (IaC):Build and manage infrastructure with Terraform; maintain and support legacy Ansible where needed.
Kubernetes & Orchestration:Operate and optimize Kubernetes clusters, leveraging Argo CD and Argo Workflows for GitOps.
CI/CD:Develop GitHub Actions pipelines and oversee the migration away from legacy Octopus Deploy.
Systems Administration:Manage Linux and Windows Server systems, ensuring performance, reliability, and security.
Monitoring & Observability:Own monitoring and observability solutions with Prometheus, Grafana, and OpenTelemetry; define and track SLOs/SLIs.
Databases & Caching:Operate MSSQL, PostgreSQL and Redis in production environments.
Networking & Security:Manage WAF and CDN services (Cloudflare) and drive secure infrastructure practices.
Qualifications
Proven experience as aSite Reliability Engineer, DevOps Engineer, or Infrastructure Engineerwith technical leadership responsibilities.
Strong Cloud platform experience using Azure
Strong expertise inTerraform; Ansible familiarity a plus.
Hands-on withKubernetesand GitOps workflows (Argo CD/Workflows).
Skilled in bothLinuxandWindows Serverenvironments.
Experienced withCI/CD pipelines, particularly GitHub Actions.
Deep understanding ofmonitoring/observability(Prometheus, Grafana, OpenTelemetry).
Strong incident management and troubleshooting skills in distributed systems.
Experience maintaining and scaling High-Traffic Web Applications
Excellent collaboration and communication skills, with experience mentoring other engineers.
You are able to work in a full-remote setup
Nice to Have
Programming/scripting skills.
Performance tuning and capacity planning expertise.
Familiarity with compliance, governance, and security standards.
Why Join Us?
Take atechnical leadership rolewhile staying hands-on with cutting-edge SRE practices.
Shape the reliability roadmap and mentor a skilled operations team.
Work across a diverse, modern stack while leading the transition from legacy systems.