As a core member of the Roche Science Infrastructure (RSI) team, the service reliability engineer will be responsible for developing and maintaining interoperability and scalability of the services in RSI. Working closely with other members of the RSI team, Enterprise Operations and Engineering groups, the successful candidate will rely on their experience, knowledge and expertise to implement end to end continuity, functionality and load testing for global RSI integrated services that should be fault-tolerant and distributed providing a consistent and reliable user experience. In this role the candidate will be responsible for understanding how services interact and depend on one another to engineer and automate away operational effort and complexity, providing proactive identification and mitigation of operational risks. The successful candidate will also be responsible for providing expertise in the enhancement of current and future RSI services, providing guidance on best practices derived from the testing, scaling and monitoring the capabilities they develop. Recognized as an expert in the field, they will provide technical consultancy to other members of Infrastructure Services, with demonstrated complex problem solving abilities. Some experience in mentoring and leading others in small team environments is highly desirable. The position is global and may be placed in one of several geographic locations.
Main responsibilities are:
- Develop meaningful metrics for performance, availability, latency and efficiency to build an overall picture of service health through continuous testing & monitoring.
Support frequent incremental service updates and improvements with meaningful tests of function, load, performance, and consistency of result/outcome.
Contribute and work with other engineers to create and maintain reusable tools that are applicable across services.
Maintain current development/test environment for testing tools and enhancements both on-prem and cloud based.
Work closely with other members of the RSI team ensure that the continuous monitoring and pre-emptive action is optimally used across the services offered by RSI (network, I&AM, tiered storage, containers, public cloud, etc.).
Apply software engineering to remove toil through automation, looking to reduce operational burden, enhance consumer experience and apply continuous improvement.
The principal Services Reliability Engineer will be an experienced IT professional. With a Bachelor’s degree (advanced degree preferred) in a relevant field of technology, science or business and possessing the following qualifications:
- 5-10 years of experience in providing scientific services monitoring & operations. Experience with advanced monitoring tools (e.g. Zabix, XdMod) highly desirable.
In-depth knowledge of tool chains and analytics capabilities for operations.
Experience with developing, deploying and using, non-intrusive monitoring and anomaly detection agents and methods.
Experience in one or more of the following programming languages: Go (golang), Python, C, C++.
Machine learning experience in modeling operational performance highly desirable.
Knowledge of configuration management tools and tool chains (e.g., Ansible, Jenkins).
Expertise in debugging and optimizing systems, and automating routine tasks.
Interest and expertise in designing, analyzing and troubleshooting distributed systems and APIs.
Good organization skills to balance and prioritize work, and ability to multitask.
Good communication skills to communicate with support personnel, customers, and managers.
Experience in applying DevOps principles and culture highly desirable.