Hci sr. compute engineer (red hat openshift)

Stefanini EMEA

Anunciada dia 22 março

Descrição

Stefanini Group is seeking a Senior Compute Engineer specialized in Red Hat OpenShift to strengthen our Compute Operations team and provide Level 3 (L3) expert support for enterprise customers running critical workloads on container and virtualization platforms. This role is a key technical position focused on day‐to‐day operations, stability, and continuous improvement of OpenShift‐based platforms. The engineer will act as the highest escalation point for complex incidents and problems, support platform lifecycle activities (upgrades, patching, performance tuning), and contribute to platform modernization initiatives – including VMware‐to‐OpenShift virtualization transformation programs. The ideal candidate combines strong troubleshooting skills, deep infrastructure understanding, and hands‐on OpenShift expertise, with the ability to work in a structured operational environment (ITIL/m‐services) while also supporting automation and standardization.Job Responsibilities Level 3 Operations & Technical Escalation (Core Responsibility)Act as the L3 escalation point for complex technical issues related to Red Hat OpenShift clusters (control plane, worker nodes, networking, storage, authentication)OpenShift Virtualization (KubeVirt) and VM‐based workloads hosted on OpenShiftLinux OS level issues impacting cluster stability or workloadsOwn and drive resolution of Major Incidents (P1/P2) with deep technical investigation and rapid recovery focusRecurring incidents through Problem Management (root‐cause analysis and permanent fixes)Lead deep troubleshooting activities: cluster degradation, node failures, API instability, etcd performance issues, networking issues (ingress, routes, DNS, CNI, service connectivity), storage issues (persistent volumes, performance bottlenecks, CSI failures), workload failures (pods, operators, deployments, stateful applications)Provide clear technical updates during incidents, including impact assessment, recovery plan/work‐around, risks and next stepsPlatform Lifecycle Management (Upgrades, Patching, Stability)Plan and execute OpenShift lifecycle activities such as version upgrades (cluster upgrades and operator upgrades), patching, security hardening, certificate management, and renewal processesValidate platform readiness before changes: capacity, compatibility, performance, known issuesMaintain high availability and resilience: backup/restore strategy support (including etcd backup practices), disaster recovery readiness, and operational runbooksEnsure operational compliance with defined maintenance windows and change governanceVMware‐to‐OpenShift Virtualization Transformation SupportSupport enterprise modernization initiatives involving migration from traditional virtualization platforms (VMware) to OpenShift VirtualizationContribute to migration approach definition and technical design support, workload onboarding, validation, and stabilization on OpenShift, performance tuning and operational model definition for VM‐based workloads on OpenShiftEnsure production‐grade operational readiness: monitoring, alerting, backup, patching and support model aligned with managed services standardsStandardization, Automation & Operational ImprovementDevelop and maintain operational documentation, including troubleshooting guides, standard operating procedures (SOPs), build standards, reference architectures, operational runbooks for recurring tasksSupport automation initiatives using tools such as Ansible/Automation Platform (preferred), GitOps practices (ArgoCD) where applicable and scripting (Bash/Python) to reduce manual operationsProactively identify improvements to increase platform stability, recovery speed (MTT), repeatability and reduction of human errorMonitoring, Observability & Performance ManagementSupport and improve observability across the platform: OpenShift monitoring stack (Prometheus/Alertmanager/Grafana), log management (EFK/Loki or enterprise logging platforms)Troubleshoot performance issues related to compute resource constraints, scheduling and resource requests/limits and cluster scaling and capacity planningWork with customer stakeholders and internal teams to define alert thresholds, reduce noise and false positives and improve operational dashboards and health reportingSecurity & Compliance SupportEnsure the platform is operated in a secure manner aligned with enterprise expectations: RBAC best practices, integration with enterprise identity providers (LDAP/AD/SSO), secure cluster configuration and segregation, support vulnerability remediation and platform hardening initiatives, collaborate with Security teams for audits, compliance requests, and evidence collectionJob Requirements Mandatory Technical SkillsStrong hands‐on experience with Red Hat OpenShift administration and operationsStrong Linux background (RHEL preferred), including troubleshooting OS performance, services, networking, and storageSolid understanding of Kubernetes fundamentals: pods, deployments, services, ingress, namespaces, RBAC, operatorsExperience troubleshooting infrastructure‐related issues across compute, network, storage, and platform servicesExperience working in production environments with uptime and SLA commitmentsMandatory Professional SkillsProven ability to operate as Level 3 support, including deep troubleshooting, structured root‐cause analysis and ownership until resolutionAbility to communicate clearly with customers (technical and non‐technical stakeholders) and internal teams (L1/L2/architects/project teams)Strong documentation discipline and operational mindsetPreferred/Nice‐to‐Have SkillsExperience with OpenShift Virtualization (KubeVirt) and VM‐based workloadsExperience supporting VMware environments and understanding virtualization concepts: vSphere architecture, clusters, HA/DRS, storage/datastores, VM lifecycleExperience with automation tools: Ansible/Red Hat Ansible Automation Platform, GitOps tools (ArgoCD), Infrastructure as Code practicesExperience with enterprise storage and CSI integrationsExperience with enterprise networking topics (DNS, routing, firewall constraints, load balancing)Experience with public cloud OpenShift deployments (optional): ROSA/ARO/OCP on AWS/Azure/GCPCertifications (Preferred)Red Hat Certified Specialist in OpenShift Administration (preferred)Red Hat Certified Engineer (RHCE) (strong advantage)Kubernetes certifications (CKA/CKAD) (nice to have)Working Model & Operational ExpectationsWork in an operational environment following ITIL practices (Incident/Problem/Change Management) and managed services delivery model and SLA commitmentsParticipate in on‐call rotation, planned maintenance windows and technical escalation duty as requiredProvide clear handovers and updates to ensure continuity across shifts/regionsWhat's Next It's best to apply today, because job postings can be taken down and we wouldn't want you to miss this opportunity. In case you need further information, just send us a message at recruitmentEMEA@stefanini.com and we'll be happy to assist!Diversity & Inclusion Here at the Stefanini Group, we value plurality and equity, regardless of race, sexual orientation, disability, age, ancestry, religion, gender, and nationality. We understand and encourage the importance of being you!#J-18808-Ljbffr

Se candidatar

Criar um alerta

Salvar