Rising From The Ashes, Untangling the Mysteries of DevOps

08.12.2019

There are many books that have influenced me as a software engineer. From the computer science textbooks I used in college, to the animal covered books from O’Reilly I used to pick up new languages and tools, to the Clean Code books from Uncle Bob Martin that helped shape the way I approach software development. However, no book has made me rethink the way I work more than The Phoenix Project, a novel by Gene Kim, Kevin Behr, and George Spafford.

In The Phoenix Project the protagonist, Bill Palmer, is unwillingly promoted to the VP of IT Operations and has to save an organization that is dealing with out of control IT costs, system outages, and release a major project that is behind schedule, over budget, and riddled with bugs. He realizes that continuing to run IT operations the way they always have would lead the company to bankruptcy. To solve these problems, Bill takes a look at how the company runs their manufacturing plants and the optimizations they learned from Lean manufacturing. By applying those lessons to the IT department he discovers the processes that most of us have come to know as DevOps.

The Three Ways

As Bill looked down over the factory floor and thought about how to solve the company’s IT problems he began to understand “The Three Ways” and how he could leverage these ideas to save the business.

The First Way: Systems Thinking

Software development has traditionally been very siloed. Architects and analysts come up with a design and pass it to developers, developers implement the requirements and pass it to QA, QA tests the software and passes it to operations, operations deploys and monitors the software. The shift to agile in recent years has had the advantage that members of each of these teams are brought together onto a single scrum team, but individual stories tend to waterfall between different team members as the story goes from development to testing to deployment.

Think of the Software Development LifeCycle (SDLC) as a factory floor. On one side of the factory you have raw materials coming in. These are your product backlogs–the work that has yet to be done. On the other side of the factory you have finished goods being shipped out to distributors. This is your completed work that is deployed to production. On the factory floor you have various work stations where work is being done to turn raw materials into finished goods. These stations represent the various states of each story: development, review, QA, deployment, etc.

The customer only benefits from finished goods, from features that are done. Would you go to the store and buy something that is only 75% complete? Your business and customers benefit from getting features through the system quickly, and it hurts everyone when those features are stuck in the system–like materials sitting on the factory floor. The real killer of productivity is Work In Progress (WIP). WIP happens when new work is started before the existing work has been fully completed. This could mean stories that are blocked, waiting for QA or deployment. WIP means that you’re spending time and money and not producing any goods. WIP needs to be completed before new work is started.

There are a few techniques we can use to reduce WIP. As engineers, our favorite is automation. Automated tests are key for getting quick feedback and reducing manual testing time (look at the testing pyramid). Automated deployments are key as well, so that work can flow immediately into environments rather than waiting for someone to deploy it. But really the key to reducing WIP is team collaboration. When a story is in progress, all members of the team should be working to push the story forward, even if it means taking on a different role than they normally would (think of where the word “scrum” comes from). The team should continue to optimize this system to make it more efficient so that work flows through without getting stuck.

The Second Way: Continuous Feedback

Building on the efficiencies of the first way, the team must make feedback loops as short as possible. This includes feedback from customers and external stakeholders, as well as internal members of the team. Everyone involved in the project should be communicating frequently. Daily standups, frequent retrospectives, demos, etc. They should be functioning as a single team, not multiple teams with differing priorities. Feedback should be a natural part of the process. Changes should be analyzed, estimated, scheduled, and not send the team scrambling to try to get new changes implemented.

Internally, you can automate many of the feedback loops. Automated testing is key here, the idea is to fail quickly and get feedback as soon as possible. Automated deployments can also help spot issues early on, and monitoring of these deployments can help spot error rate and performance spikes. Along the same lines, canary deployments are a useful tool where you can open up new features to a subset of users in production and monitor for any issues that arise.

The Third Way: Experimentation and Learning

Often as we rush to get software out the door we stop focusing on improving our processes, tools, and ourselves. The third way teaches us that we need to allocate time to experiment, and also reflect and learn from our mistakes.

One technique for experimentation is known as a “spike.” With a spike the team can create a story to allow the team to explore a new tool or process, and time-box it by assigning a maximum amount of story points they want to allocate for experimentation.

Retrospectives and blameless postmortems are great ways to learn from mistakes that have been made. These only work if we allocate time to implement the changes that are suggested. A good quote to remember is: “The cost of failure is education” (from Devin Carraway).” Continuing to do things the same way just because that’s the way they have always been done will hold the team back and prevent you from realizing the goals of the first and second ways.

Real World Examples of DevOps

Google

Google wrote the book on what they call Site Reliability Engineering. It’s been said that “class SRE implements DevOps.” Site Reliability Engineers (SREs) focus on improving the reliability of systems, as well as eliminating toil, reducing manual work in favor of automation, and managing overall risk.

An SRE must spend at least 50% of their time on engineering tasks that will reduce toil (the tedious manual, repetitive tasks). This means they focus most of their time on automating many aspects of the system, which in turn means they can improve how quickly new features can go to production, and they can scale the system without the need to scale the number of people running the system.

SREs also manage what they call an Error Budget. By looking at the difference between the expected uptime of the system and the actual uptime, they know how much risk they can tolerate. When there have been fewer errors the team is better able to experiment and push out risky features. When the available error budget is low, it’s time to focus on paying back technical debt and improve reliability.

Amazon Web Services (AWS)

AWS provides a definition of their recommended DevOps practices, and also has whitepapers with recommended best practices. They suggest processes such as Infrastructure As Code (IaC), Blue/Green deployments, and Continuous Integration / Continuous Deployment (CI/CD) in order to facilitate fast and reliable deployments.

Amazon famously implemented the “Two Pizza Team” rule, where teams must be small enough in size to be fed by two pizzas. Also, they promote the idea of “You build it you run it” which forces the small teams to break down silos and focus on designing and building software that is production-ready as those responsible for building the software must also be responsible for keeping it running in production.

Netflix

Netflix also embraces the “Operate what you build” mentality and allows developers to do production deployments, but holds them responsible for keeping their own code running. They build a lot of specialized tooling to help make this process faster, and have teams for creating build tools, automated pipelines, etc so that teams are able to ship their code faster. They also pioneered the concept of Chaos Engineering where their systems are built (and tested) to be resilient against unexpected outages.

Summary

What DevOps really comes down to is breaking down traditional silos to deliver features faster. This doesn’t necessarily mean rebranding the operations team or adding new tooling–rather a rethinking of the way that you build and deliver software. As Bill Palmer learned, look outside of the way that things have always been done and find what works best for your organization.