Our client is a global marketing services company with an online platform focused on helping SMEs create and manage their marketing more easily and affordably, with fast production and delivery in multiple countries. They're active in multiple international markets across Europe and North America.
They're looking for a Site Reliability Engineer (SRE) to lead their monitoring and observability efforts. You'll define and improve SLOs and SLIs, guide teams on best practices, and help maintain a stable, reliable platform through modern monitoring solutions.
Key Responsibilities Lead Monitoring & Observability Strategy: Develop and lead the implementation of the company's monitoring and observability approach.Define & Maintain SLOs/SLIs: Set, implement, and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services.Mentor Product Managers & Engineering Leads: Guide teams on the definition and optimisation of SLOs/SLIs.Collaborate Across Teams: Work closely with engineering, product, quality, and monitoring teams to manage incidents and maintain system health.Set Up Monitoring Tools: Configure and manage tools like Datadog, Cloudflare, and Azure Cloud to monitor platform performance.Improve Incident Management: Continuously improve processes to identify and resolve performance bottlenecks.Optimise CI/CD Processes: Enhance CI/CD pipelines for better performance, reliability, and incident prevention.Integrate Observability in Testing: Collaborate with QA teams to incorporate observability into testing processes for early issue detection.Ensure High Availability & Security: Implement best practices to maintain high availability, performance, and security across the infrastructure.Evolve SRE Practices: Drive the evolution of SRE practices and foster a culture of observability within the team. What You Bring Site Reliability Engineering Experience: Mid-level to senior experience in an SRE role, with a solid background as a developer.E-commerce Experience: Experience working on high-traffic, customer-facing platforms such as e-commerce.Monitoring & Observability Expertise: Strong experience with monitoring tools, observability frameworks, and related technologies.Experience with Datadog or Similar Tools: Hands-on experience with Datadog or similar monitoring tools.Cloud Experience: Experience working in a cloud-focused environment (e.g., Azure or similar).Scripting Proficiency: Proficient in scripting for automation and system management.SLO/SLI Implementation: Proven experience defining and implementing SLOs and SLIs for large-scale systems.Incident Management & Collaboration: Deep understanding of incident management and effective collaboration with engineering teams.Passion for System Reliability: Monitoring-focused and passionate about enhancing system reliability and visibility.Mentorship Experience: Previous experience in mentoring and guiding teams on observability best practices. Why Apply Now? Don't miss the opportunity to make a significant impact in a dynamic environment. This role allows you to mentor teams, implement best practices, and drive system improvements. Enjoy a flexible 4-day workweek and 100% remote work (Portugal-based).
Are you ready to take the next step in your career? Send your CV to ******