Machine Learning Operations Engineer
We are seeking a highly skilled Machine Learning Operations (ML Ops) Engineer to join our team.
* Key Responsibilities:
1. Design and Develop Infrastructure: Plan, execute, and test disaster recovery processes for our machine learning services.
2. Manage Resource Allocation: Manage resource allocation and workload scheduling for multiple ML services, ensuring efficient utilization of CPU/GPU resources and creating reliable queues based on service priorities.
3. Maintain VM Environments: Maintain VM environments and manage OS updates, keep up-to-date VM inventory.
4. Troubleshoot System Configurations: Troubleshoot and provide solutions for system configurations.
5. Implement CI/CD Pipelines: Implement and manage the CI/CD pipelines to ensure seamless and efficient deployment of ML models.
6. Collaborate with Data Scientists: Collaborate with data scientists, ML researchers, and language experts to understand the requirements for deploying ML models and provide necessary infrastructure support.
7. Automate Build, Test, and Deployment Processes: Automate and streamline the build, test, and deployment processes to enhance efficiency and reduce time-to-market.
-----------------------------------
Required Skills and Qualifications:
* Strong knowledge of cloud platforms (such as AWS, Azure, or GCP) and local cluster deployments.
* Knowledge of distributed computing frameworks (e.g., Spark) and big data technologies (e.g., Hadoop, Kafka).
* Proficiency in Python, Shell, Ruby, Golang, or C++ and experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation).
* Hands-on experience with containerization technologies (e.g., Docker) and orchestration frameworks (e.g. Kubernetes).
* Familiarity with CI/CD tools (e.g., Jenkins, GitLab CI/CD) and version control systems (e.g., Git).
* Solid understanding of networking, security, and system administration concepts.
* Strong problem-solving and troubleshooting skills, with the ability to quickly analyze and resolve issues in complex ML systems.
* Excellent communication and collaboration skills, with the ability to work effectively in a team-oriented environment.
* Bachelor's or higher degree in Computer Science, Engineering, or a related field.
-----------------------------------
Desired Skills and Experience:
* Experience with machine learning frameworks and libraries, such as TensorFlow, PyTorch, or scikit-learn.
* Familiarity with serverless computing and event-driven architectures.
* Experience with logging and monitoring tools (e.g., ELK Stack, Prometheus, Grafana).