Team Lead: Core & MLOps Squad
Location: Lisbon, Portugal (remote-friendly)
At Zyte, we eat data for breakfast and you can eat your breakfast anywhere and work for us. Founded in 2010, we are a globally distributed team of over 250 Zytans working from more than 28 countries, on a mission to enable our customers to extract the data they need to continue to innovate and grow their businesses. We believe that all businesses deserve a smooth pathway to data.
What You'll Do
Technical Leadership
- Design and evolve the core platform (Kubernetes, Mesos, GPU scheduling/autoscaling, distributed compute)
- Own the model platform: registry, experiment tracking, training orchestration, evaluation, serving, and monitoring
- Build the Golden Path: reference repos, a scaffold CLI, opinionated CI/CD pipelines, runtime contracts (health/metrics/tracing/SLOs), high-performance clients, circuit breakers and other production‐ready defaults
MLOps Excellence
- Operate a secure, multi‐tenant model registry and training platform with standardized experiment/evaluation harnesses
- Provide turnkey serving patterns (online + batch), drift/quality monitoring, and rollback playbooks
- Integrate public/open‐source AI capabilities as managed platform services with cost and data‐governance guardrails
Team Management
- Run the squad: roadmap/prioritization, delivery, mentoring, and high engineering standards
- Partner with product engineering (Zyte API, Scrapy Cloud), Prod Ops, and Security on adoption and rollout plans
- Mentor the team and foster a platform-thinking mindset
Ownership Areas
- Container orchestration (Kubernetes/Knative), GPU provisioning & autoscaling, environment & secret management
- Operators, sidecars, and internal SDKs/libraries (Go/Rust/Python/Java) that enforce the golden path contract
- Model platform: registry, experiment tracking, training orchestration, evaluation framework, serving infra, model monitoring
- Observability: logging/metrics/tracing pipelines
- Billing pipeline: metering/events/cost tracking abstractions
- Golden Path: Java, Python, ML templates + CI/CD blueprints + docs + scaffold CLI
- Reliability enablement (SRE practices), cost governance, supply‐chain security (SBOM, image signing)
Required Qualifications
- 5+ years experience building distributed systems; 3+ years in MLOps/ML platform engineering (or equivalent impact)
- Knowledge of Linux/OS internals (process model, cgroups/namespaces), networking (TCP/IP, HTTP/2), concurrency, and performance profiling
- Deep understanding of Kubernetes (bonus: Mesos)
- Proficiency developing high-performance services in Java, Rust, Go or C++ (bonus: familiarity with vert.x and Netty frameworks); strong Python skills
- Experience with GPU infrastructure (scheduling, containerization, optimization)
- Track record of designing and operating model platforms (registry, training, serving, monitoring) in production
- Demonstrated success leading technical teams and implementing organization-wide platform solutions
Preferred Qualifications
- Streaming & workflows: Kafka plus Argo/Temporal/Airflow or equivalents
- eBPF‐based observability, perf tooling, or io_uring experience
- Cost optimization for ML/AI; multi‐tenant quotas and fairness
- Hands‐on experience authoring Golden Paths (service chassis/templates, CI/CD blueprints, CLI scaffolds)
- SRE practices (SLIs/SLOs, incident management)
Benefits
- We love fostering and nourishing new ideas and bringing them to market
- Become part of a self‐motivated, progressive, multi‐cultural team
- Have the freedom and flexibility to work from where you do your best work, as we are a completely remote company
- Get the chance to work with cutting‐edge open‐source technologies and tools
#J-18808-Ljbffr