Platform engineering team lead - remote

Funchal

Zyte

Anunciada dia 6 maio

Descrição

Team Lead: Core & MLOps Squad

Location: Lisbon, Portugal (remote-friendly)

At Zyte, we eat data for breakfast and you can eat your breakfast anywhere and work for us. Founded in 2010, we are a globally distributed team of over 250 Zytans working from more than 28 countries, on a mission to enable our customers to extract the data they need to continue to innovate and grow their businesses. We believe that all businesses deserve a smooth pathway to data.

What You'll Do

Technical Leadership

- Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute)

- Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring

- Build the Golden Path: reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts (health/metrics/tracing/SLOs), high-performance clients, circuit breakers and other production‐ready defaults

MLOps Excellence

- Operate a secure, multi‐tenant model registry and training platform with standardized experiment/evaluation harnesses

- Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks

- Integrate public/open‐source AI capabilities as managed platform services with cost and data‐governance guardrails

Team Management

- Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards

- Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans

- Mentor the team and foster a platform-thinking mindset

Ownership Areas

- Container orchestration (Kubernetes/Knative), GPU provisioning & autoscaling, environment & secret management

- Operators, sidecars, and internal SDKs/libraries (Go/Rust/Python/Java) that enforce the golden path contract

- Model platform: registry, experiment tracking, training orchestration, evaluation framework, serving infra, model monitoring

- Observability: logging/metrics/tracing pipelines

- Billing pipeline: metering/events/cost tracking abstractions

- Golden Path: Java, Python, ML templates + CI/CD blueprints + docs + scaffold CLI

- Reliability enablement (SRE practices), cost governance, supply‐chain security (SBOM, image signing)

Required Qualifications

- 5+ years experience building distributed systems; 3+ years in MLOps/ML platform engineering (or equivalent impact)

- Knowledge of Linux/OS internals (process model, cgroups/namespaces), networking (TCP/IP, HTTP/2), concurrency, and performance profiling

- Deep understanding of Kubernetes (bonus: Mesos)

- Proficiency developing high-performance services in Java, Rust, Go or C++ (bonus: familiarity with vert.x and Netty frameworks); strong Python skills

- Experience with GPU infrastructure (scheduling, containerization, optimization)

- Track record of designing and operating model platforms (registry, training, serving, monitoring) in production

- Demonstrated success leading technical teams and implementing organization-wide platform solutions

Preferred Qualifications

- Streaming & workflows: Kafka plus Argo/Temporal/Airflow or equivalents

- eBPF‐based observability, perf tooling, or io_uring experience

- Cost optimization for ML/AI; multi‐tenant quotas and fairness

- Hands‐on experience authoring Golden Paths (service chassis/templates, CI/CD blueprints, CLI scaffolds)

- SRE practices (SLIs/SLOs, incident management)

Benefits

- We love fostering and nourishing new ideas and bringing them to market

- Become part of a self‐motivated, progressive, multi‐cultural team

- Have the freedom and flexibility to work from where you do your best work, as we are a completely remote company

- Get the chance to work with cutting‐edge open‐source technologies and tools

#J-18808-Ljbffr

Se candidatar

Criar um alerta

Salvar