As a DevOps & ML Ops Engineer, you will play a crucial role in our organization by developing and maintaining scalable, stable services that deliver machine learning models to end users with guaranteed uptime.
The primary focus of this position will be on the infrastructure, deployment, and continuous integration/continuous delivery (CI/CD) processes for our ML services. This includes managing resource allocation and workload scheduling for multiple ML services, ensuring efficient utilization of CPU/GPU resources and creating reliable queues based on service priorities.
You will also maintain VM environments and manage OS updates, keep up-to-date VM inventory. Additionally, you will work alongside the Dev and QA team to detect hot spots in our applications and set preventative measures before they become live issues.
Troubleshooting and providing solutions for system configurations are also key responsibilities of this role. You will be responsible for planning, executing, and testing disaster recovery, as well as monitoring and examining all application, performance, event, and system logs to assist in troubleshooting.
Filing all IT/Colocation tickets and escalating to the right person if necessary is another important aspect of this job. Furthermore, you will design, develop, and maintain the infrastructure required for deploying and scaling machine learning services.
This includes implementing and managing the CI/CD pipelines to ensure seamless and efficient deployment of ML models. Collaboration with data scientists, ML researchers, and language experts to understand the requirements for deploying ML models and provide necessary infrastructure support is also essential.
Automating and streamlining the build, test, and deployment processes to enhance efficiency and reduce time-to-market are critical tasks in this role. Monitoring and optimizing the performance, availability, and scalability of production ML systems, as well as developing and maintaining robust monitoring, logging, and alerting systems to proactively identify and address issues, are also key responsibilities.
Required Skills and Qualifications
* We are looking for someone with strong knowledge of cloud platforms such as AWS, Azure, or GCP, and experience in deploying and managing ML services on these platforms.
* Knowledge of distributed computing frameworks such as Spark and big data technologies such as Hadoop and Kafka is also necessary.
* Proficiency in languages such as Python, Shell, Ruby, Golang, or C++ and experience with infrastructure-as-code tools such as Terraform and CloudFormation is required.
*