Site reliability engineer

Coimbra

Intermedia Intelligent Communications

Anunciada dia 3 junho

Descrição

Company Details
Department:Tech OperationsLocation:PortugalDescription:ALL CANDIDATES MUST BE LOCATED IN PORTUGAL**We offer a hybrid working style, with an office in Coimbra and plans to open offices in Aveiro and Porto in the future. This approach gives team members the flexibility to work remotely while also coming together in the office for collaboration and teamwork.About Intermedia
Are you looking for a company whereYOUR VOICEis heard? Where you canMAKE A DIFFERENCE ? Do youTHRIVEin aFAST-PACEDwork environment? Do you wake every morningEXCITEDto work withGREAT PEOPLEand createSUCCESSTOGETHER ? Then Intermedia is the place for you.Intermedia has established itself as a leading provider of cloud communications and collaboration tech that allows companies to connect better. We have a strong track record of growth, profitability, and creating an environment where everyone matters. Everyone. While we are fast‐paced and admittedly a bit intense, we promise that you won't be bored. You will find Intermedia is a place where you can indulge your passion for creating and supporting great cloud technology. What's more, we always look to promote from within and have many employees who have been with us 10, 15, and 20+ years!Culture at Intermedia is built on teamwork and transparency. We hold each other accountable and always have each other's back! Are you ready to make your mark?About The Role
We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. You will build and maintain monitoring using Prometheus/Victoria Metrics, integrate alerts and events with Big Panda, and participate in on‐call rotations to drive fast incident response and continuous improvement across Windows and Linux environments.Key ResponsibilitiesBuild and operate metrics/monitoring platforms: Prometheus and/or Victoria Metrics (scrape configs, exporters, recording rules)Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reductionIntegrate monitoring/alerting and events with Big Panda (correlation, enrichment, routing, incident workflows)Create and maintain dashboards and operational visibility (Grafana or equivalent)Develop and maintain runbooks, operational playbooks, and incident response proceduresParticipate in on‐call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outagesPerform root‐cause analysis, postmortems, and implement corrective/preventive actionsImprove service reliability via SLOs/SLIs, capacity planning, and automation to reduce toilSupport monitoring for core infrastructure and services on Windows and Linux, including HA components and clustersCollaborate with Dev Ops/Engineering to instrument applications and standardise telemetry (metrics, logs, traces where applicable)Skills, Knowledge And ExpertiseExperience in SRE / Operations / Dev Ops with production incident ownershipHands‐on experience with Prometheus and/or Victoria Metrics (exporters, alert rules, recording rules, troubleshooting)Experience integrating alerting/event pipelines with Big Panda (or similar event correlation tools)Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)Experience with Git‐based workflows for monitoring‐as‐code and configuration managementNice to haveGrafana administration and dashboard design standardsLog management (ELK/EFK, Loki) and/or tracing (Open Telemetry)Automation skills (Python, Power Shell, Bash) and configuration tools (Ansible)Messaging/cache/proxy operations: Rabbit MQ, Redis, NginxExperience with Windows clustering or HA environmentsExperience defining SLOs/SLIs and operational KPIsExperience in managing VOIP components and protocols (SIP, Free Switch, Open SIP, session border controllers)Experience with load balancing components (F5 LTM, F5 GTM)Experience with Virtualization platforms such as VMWare or Hyper VExperience with administering AWS or Azure tenantsOn-call expectationsParticipation in a rotating on‐call schedule (including nights/weekends as needed)Ownership of incident response: rapid triage, escalation, mitigation, and follow‐up improvementsCommitment to improving monitoring quality to reduce alert fatigue and improve MTTRDiversity, Inclusion, and Equal Opportunity
We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or any other basis protected by applicable law (collectively referred to in our Code of Conduct as "Protected Classes"). We do not tolerate employment discrimination in the workplace, and we are committed to making reasonable accommodations for identified disabilities or other limitations as required by all applicable laws. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Se candidatar

Criar um alerta

Salvar