Job Overview:
We are seeking an experienced technical lead to join our operations team. As a site reliability engineer, you will provide guidance and mentorship while maintaining a hands-on approach.
This role offers a unique opportunity to combine leadership with engineering excellence. You will design and implement scalable systems, ensuring the operations team has the necessary expertise and guidance to succeed.
Key Responsibilities:
* Act as technical lead for the operations team, setting standards for reliability, automation, and scalability.
* Mentor and guide engineers, fostering knowledge sharing and technical growth.
* Lead incident response, root cause analysis, and ensure postmortem learnings are translated into improvements.
* Collaborate closely with development and product teams to balance agility with operational stability.
Technical Requirements:
* Infrastructure as Code (IaC): Build and manage infrastructure with Terraform; maintain and support legacy Ansible where needed.
* Kubernetes & Orchestration: Operate and optimize Kubernetes clusters, leveraging Argo CD and Argo Workflows for GitOps.
* CI/CD: Develop GitHub Actions pipelines and oversee the migration away from legacy Octopus Deploy.
* Systems Administration: Manage Linux and Windows Server systems, ensuring performance, reliability, and security.
* Monitoring & Observability: Own monitoring and observability solutions with Prometheus, Grafana, and OpenTelemetry; define and track SLOs/SLIs.
* Databases & Caching: Operate MSSQL, PostgreSQL, and Redis in production environments.
* Networking & Security: Manage WAF and CDN services (Cloudflare) and drive secure infrastructure practices.
Requirements:
* Proven experience as a Site Reliability Engineer, DevOps Engineer, or Infrastructure Engineer with technical leadership responsibilities.
* Strong Cloud platform experience using Azure.
* Strong expertise in Terraform; Ansible familiarity a plus.
* Hands-on with Kubernetes and GitOps workflows (Argo CD/Workflows).
* Skilled in both Linux and Windows Server environments.
* Experienced with CI/CD pipelines, particularly GitHub Actions.
* Deep understanding of monitoring/observability (Prometheus, Grafana, OpenTelemetry).
* Strong incident management and troubleshooting skills in distributed systems.
* Experience maintaining and scaling High-Traffic Web Applications.
* Excellent collaboration and communication skills, with experience mentoring other engineers.
* You are able to work in a full-remote setup.
Benefits:
* Opportunity to grow professionally and technically.
* Flexible working environment.
* Competitive compensation package.