Ml infrastructure specialist

Oeiras

beBeeDevops

Anunciada dia 30 novembro

Descrição

As a DevOps & ML Ops Engineer, you will play a crucial role in our organization by developing and maintaining scalable, stable services that deliver machine learning models to end users with guaranteed uptime.

The primary focus of this position will be on the infrastructure, deployment, and continuous integration/continuous delivery (CI/CD) processes for our ML services. This includes managing resource allocation and workload scheduling for multiple ML services, ensuring efficient utilization of CPU/GPU resources and creating reliable queues based on service priorities.

You will also maintain VM environments and manage OS updates, keep up-to-date VM inventory. Additionally, you will work alongside the Dev and QA team to detect hot spots in our applications and set preventative measures before they become live issues.

Troubleshooting and providing solutions for system configurations are also key responsibilities of this role. You will be responsible for planning, executing, and testing disaster recovery, as well as monitoring and examining all application, performance, event, and system logs to assist in troubleshooting.

Filing all IT/Colocation tickets and escalating to the right person if necessary is another important aspect of this job. Furthermore, you will design, develop, and maintain the infrastructure required for deploying and scaling machine learning services.

This includes implementing and managing the CI/CD pipelines to ensure seamless and efficient deployment of ML models. Collaboration with data scientists, ML researchers, and language experts to understand the requirements for deploying ML models and provide necessary infrastructure support is also essential.

Automating and streamlining the build, test, and deployment processes to enhance efficiency and reduce time-to-market are critical tasks in this role. Monitoring and optimizing the performance, availability, and scalability of production ML systems, as well as developing and maintaining robust monitoring, logging, and alerting systems to proactively identify and address issues, are also key responsibilities.

Required Skills and Qualifications

* We are looking for someone with strong knowledge of cloud platforms such as AWS, Azure, or GCP, and experience in deploying and managing ML services on these platforms.

* Knowledge of distributed computing frameworks such as Spark and big data technologies such as Hadoop and Kafka is also necessary.

* Proficiency in languages such as Python, Shell, Ruby, Golang, or C++ and experience with infrastructure-as-code tools such as Terraform and CloudFormation is required.

*

* Familiarity with CI/CD tools such as Jenkins and GitLab CI/CD, as well as version control systems such as Git, is required.

* Solid understanding of networking, security, and system administration concepts is necessary.

* Strong problem-solving and troubleshooting skills, with the ability to quickly analyze and resolve issues in complex ML systems, are also required.

* Excellent communication and collaboration skills, with the ability to work effectively in a team-oriented environment, are necessary.

* Bachelor's or higher degree in Computer Science, Engineering, or a related field is required.

* Proven experience as an ML Ops Engineer, DevOps Engineer, or a similar role, with a focus on deploying and maintaining machine learning models in production environments, is necessary.

Desired Skills and Experience

* Experience with machine learning frameworks and libraries such as TensorFlow, PyTorch, or scikit-learn is desired.

* Familiarity with serverless computing and event-driven architectures is also desirable.

* Experience with logging and monitoring tools such as ELK Stack, Prometheus, and Grafana is desired.

* Understanding of software development methodologies and agile practices is also desirable.

Se candidatar

Criar um alerta

Salvar