Kubernetes Senior Reliability Engineer (3 positions)

España, Madrid
Costa Rica, Costa Rica

en de fr ru tr it pt zh ja

We are looking for Reliability Engineers for our Container-as-a-Service product team.  Our team is mandated to provide central platforms, services and expertise on Kubernetes deployments for Science and Enterprise activities around the world at Roche.  As with all reliability engineers, experience with Infrastructure-as-Code are essential, and specific skills needed in this team are Kubernetes/Containers and Public Cloud IaaS/PaaS deployments.  Reliability Engineers work with a positive attitude in a self-managed team helping teammates learn & develop.  They engage as ambassadors to our internal consumers, understanding business needs and translating them into infrastructure solutions.

Reliability Engineering (RE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. RE ensures that Roche’s services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users' needs and a fast rate of improvement. Additionally RE’s will keep an ever-watchful eye on the systems capacity and performance. Much of their work focuses on building infrastructure, optimizing existing systems and eliminating work through automation.

A Reliability Engineer is an infrastructure engineer who knows how to apply engineering principles to operations. They are well versed in a large number of technologies and welcome new tools and techniques. They work in conjunction with fellow engineering and operations members to come to the best possible solution. They are always looking for patterns and ways to increase efficiency, eliminate downtime, optimize costs, and maintain performance at scale. They will also advise our consumers on RE value proposition, adoption, industry best practices, and implementation strategy.

REs are responsible for the big picture of how the systems relate to each other, using a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.

RE teams will have the opportunity to manage the complex challenges of scale which are unique to Roche, while using expertise in coding, algorithms, complexity analysis and large-scale system design.

RE's culture of diversity, intellectual curiosity, problem solving and openness is key to its success. It brings together people with a wide variety of backgrounds, experiences and perspectives. They are encouraged to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. REs provide on-call support to keep systems up and running, ensuring the consumers have the best and fastest experience possible.

Job Responsibilities

  • Contribute to activities focused on availability, tuning, performance, efficiency, change management, monitoring, emergency response and capacity planning.

  • Engage in and improve, under guidance, the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Collaborate in the creation of  a bridge between engineering and operations by applying a software engineering mindset to system administration topics.

  • Monitors and resolves Incident/problems with platform operations, suggesting priorities and collaborating in the resolution when required.

  • Contribute to support services before they go live through activities such as infrastructure design consulting, developing software platforms and frameworks, capacity planning and launch reviews.

  • Collaborate with Managed Services suppliers and external consultancy, ensuring the collaboration is as effective as possible.

  • Scale systems sustainably through mechanisms like automation, and evolve systems by proposing changes that improve reliability and velocity.

  • Contribute to the maintenance of  services once they are live by measuring and monitoring availability, latency and overall system health.

  • Look for continuous improvement activities both in technical, teamwork, collaboration and processes areas. Propose and contribute to continuous improvement activities.

  • Act as an analyst by transforming the customer needs into specific requirements to be implemented in components managed by the team or by other teams.

  • Remain proactive and aware of operational challenges and opportunities and work with support team staff to resolve incidents and major incidents.

  • Ensure implemented solutions and components comply with Quality/Regulatory standards, as applicable. 


 

Job Requirements / Qualifications

  • Good interpersonal skills.

  • Demonstrated customer & delivery focus.

  • Well proven scripting and automation skills with strong knowledge in delivering and managing infrastructure as code.

  • Ability to work effectively with team members and virtual teams from different locations and different cultural backgrounds.

  • Ability to function independently with low supervision and navigate ambiguity.

  • Strong problem-solving and decision-making skills.

  • Good oral and written communication skills in English. German, Spanish or Chinese (Mandarin) are significant pluses.

  • Moderate travel (20%) required and ability to work across multiple time zones, including on-call.


 

Education / Years of Experience

4-7 years of relevant work experience  

or 2-5 years with Bachelor’s degree

or 1-3 years with Masters degree

At least 1 years experience of working in one or more multinational work environments (e.g. healthcare industry experience is a plus) as a senior systems or software Engineer.

Technology Skills

  • Hands-on technical skills in automation, infrastructure as code, logging, monitoring and observability, infrastructure configuration, scripting languages and applications. 

  • Knowledge about working with Infrastructure Systems internals, their administration and networking.

  • Knowledge about applying design thinking, lean, prioritization and agile methodologies to evolve services offered to partners.

  • Knowledge about the definition of technical computing infrastructure entirely under the control of software with no operator or human intervention.

  • Knowledge about defining Service Level Objectives and Service Level Indicators.

  • Knowledge about DevOps mindset, processes and tools.

  • Cross-Functional Technical Knowledge, tools/scripting/methodologies for: Configuration management, Infrastructure as Code, Automation Design, Infrastructure Development Life Cycle and hybrid Clouds.

  • Knowledge about algorithms, data structures, complexity analysis and software design.

At Roche, we believe in diversity and inclusion as essential values for our success. We have a special interest in integrating people with disabilities into our teams. If you have a disability, for us it is a plus, and we have special benefits for you: Go ahead and present your candidacy!