Netflix and Chaos Monkey: A DevOps Case Study
DevOps is a mindset that revolves around adapting processes and organizational structures to prioritize business value, essential software quality attributes, and continuous improvement. While commonly associated with practices like Agile development, automation, and continuous delivery, the essence of DevOps extends to various applications.
A notable case study for DevOps is Netflix, exemplifying a comprehensive understanding of DevOps principles and a commitment to quality attributes through automated processes. DevOps advocates emphasize a keen focus on quality attributes, utilizing automation for consistency and efficiency to meet business needs.
Netflix’s streaming service, operating on Amazon Web Services (AWS), is a complex distributed system with numerous interconnected components. To ensure reliable video streaming across diverse devices, Netflix engineers concentrated on the quality attributes of reliability and robustness for both server- and client-side components. Recognizing that the best way to handle failure is through practice, they embraced DevOps by automating failure.
Users of Netflix software may have observed occasional changes in available video streams without experiencing crashes, errors, or performance degradation. This is attributed to the ‘Chaos Monkey,’ a tool within the Netflix Simian Army series. Chaos Monkey, a continuous script running in all Netflix environments, randomly shuts down server instances. This deliberate introduction of chaos during the development process allows developers to test their software under unexpected failure conditions, fostering the creation of fault-tolerant systems.
The use of Chaos Monkey not only provides a unique testing environment but also encourages developers to design modular, testable, and resilient systems from the outset. DevOps, exemplified by Netflix’s approach, involves altering the development process through automation to establish a system where the behavioral economics favor the production of high-quality software.
In a DevOps organization, leaders must ponder how to incentivize desired outcomes and drive organizational change. Embracing DevOps requires a willingness to make necessary changes and sacrifices, including intentionally causing failures, to set the organization up for success. Netflix credits its ‘chaos testing’ approach for enabling systems to handle the 10% AWS server reboot on 9/25/14 seamlessly. The success of this strategy led to the development of the Simian Army, a suite of tools for chaos testing, now available as open-source software.