Senior devops / site reliability engineer (sre)

Funchal

emagine - Portugal

Anunciada dia 13 junho

Descrição

️ Mandatory Languages: French & English We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to join an international team responsible for ensuring the reliability, scalability, security, and performance of cloud-native platforms and critical production environments. This is an excellent opportunity for someone passionate about automation, observability, cloud infrastructure, and Site Reliability Engineering practices, who enjoys working in highly available, large-scale environments. What you'll be doing: Design, implement, and maintain highly available cloud infrastructure, primarily on AWS. Improve system reliability through SRE best practices, including SLOs, SLIs, and error budgets. Build, maintain, and optimize CI/CD pipelines to support fast and secure software delivery. Develop and manage Infrastructure as Code using Terraform and other automation tools. Administer and optimize Kubernetes clusters and containerized environments. Implement monitoring, logging, alerting, and observability solutions. Lead incident response activities, perform root cause analysis, and contribute to postmortems. Improve platform scalability, resilience, disaster recovery, and operational excellence. Collaborate closely with development teams to enhance deployment processes and platform reliability. Implement cloud security best practices, including IAM, secrets management, vulnerability remediation, and patching. What we're looking for: 5+ years of experience in DevOps, SRE, Platform Engineering, or Cloud Infrastructure roles Strong Linux administration and troubleshooting skills Hands-on experience managing Kubernetes in production environments Experience with AWS cloud services Strong knowledge of Infrastructure as Code (Terraform preferred) Experience with CI/CD tools such as GitLab CI, Jenkins, GitHub Actions, or Azure DevOps Experience with Docker and Helm Strong understanding of observability and monitoring solutions such as Prometheus, Grafana, ELK, Datadog, or Splunk Scripting experience with Bash, Python, or similar technologies Experience participating in on-call rotations and supporting production environments Nice to have: Experience with Azure and/or Google Cloud Platform Strong networking knowledge (TCP/IP, DNS, Load Balancers, Reverse Proxies) Experience with incident management and operational excellence frameworks

Se candidatar

Criar um alerta

Salvar