DevOps Support Engineer
100% Remote – Western Europe/Portugal/UK
12+ Month Long-Term Contract
DevOps Team Service Reliability Engineer
I. Description of Services and Milestones
The Service Reliability Engineer (SRE) consultant will provide day-to-day support, monitoring, troubleshooting, and fixing of issues to ensure the reliability and performance of the MD3 infrastructure. They will work alongside and under the technical direction of Lilly staff and will be located in the Eastern time zone.
Scope of responsibilities:
Monitoring and Support:
* Continuously monitor the health and performance of the MD3 infrastructure (data observations, HPC, LiveDesign tasks)
* Utilize monitoring tools (ServiceNow, Splunk, Grafana) to detect and respond to incidents in real-time.
* Perform regular job queue checks and maintenance activities to ensure optimal performance.
* Monitor the MD3 dashboard and community chats/channels for any issues or alerts.
Troubleshooting and Fixing:
* Diagnose, troubleshoot, and potentially resolve technical issues related to the MD3 infrastructure.
* Collaborate with DevOps engineers and other technical teams to address and fix incidents.
* Document and communicate the root cause of incidents and the steps taken to resolve them.
Automation and Improvement:
* Develop and implement automation scripts to streamline monitoring and troubleshooting processes.
* Identify areas for improvement in the infrastructure and propose solutions to enhance reliability and performance.
* Participate in post-incident reviews to identify and address any gaps in the monitoring and support processes.
Collaboration and Communication:
* Work closely with the DevOps team to ensure alignment with business goals and research needs.
* Communicate effectively with stakeholders to provide updates on incidents and resolutions.
* Participate in regular standups and scrums to discuss ongoing issues and progress.
* Build and share bi-weekly reports on the status and performance of the MD3 infrastructure. \
Knowledge Management:
* Develop and maintain knowledge articles for the help desk (ServiceNow) and FAQ for users.
* Ensure that all documentation is up-to-date and easily accessible for the support team and end-users.
Service Level Agreements (SLAs):
* Identify and establish SLAs based on current ITSM practices across for incidents and problems.
* Ensure that all incidents and problems are resolved within the defined SLAs.
* Performance and Infrastructure Capacity Planning
* performance optimization: Fine-tuning applications and infrastructure to ensure systems meet performance benchmarks.
* Capacity Planning: Anticipating growth needs to scale infrastructure and prevent overutilization or underutilization of resources.
Documentation:
* Create runbooks for critical alerts