Site Reliability Engineer
This is an exciting opportunity to combine leadership with engineering excellence, designing and implementing systems that scale while ensuring your team has the guidance and expertise they need to succeed.
The ideal candidate will spearhead reliability and operations initiatives as a Technical Lead for our Site Reliability Engineering team. They will establish standards for reliability, automation, and scalability for the operations team, mentor and guide engineers, fostering knowledge sharing and technical growth.
* Establish technical direction and guide architecture decisions to ensure operational stability and scalability.
* Mentor and guide engineers in developing skills and best practices in automation, observability, and resilience.
* Lead incident response, root cause analysis, and ensure postmortem learnings are translated into improvements.
* Collaborate closely with development and product teams to balance agility with operational stability.
Technical Skills:
* Proficiency in Terraform for building and managing infrastructure.
* Experience with Kubernetes and GitOps workflows, including Argo CD and Argo Workflows.
* Skilled in Linux and Windows Server environments, with experience maintaining and scaling high-traffic web applications.
* Expertise in monitoring and observability solutions, including Prometheus, Grafana, and OpenTelemetry.
* Hands-on experience with CI/CD pipelines, particularly GitHub Actions.
Nice to Have:
* Programming or scripting skills.
* Performance tuning and capacity planning expertise.
* Familiarity with compliance, governance, and security standards.