DevOps: Year One

Magnus Hedemark

I had the pleasure of addressing the Triangle DevOps meetup group on the subject of DevOps: Year One. The target audience was anyone who has bought into the idea of DevOps transformation for their business, but wanted practical advice for how to get started.

With only an hour to speak, and so many great questions to answer, we barely got to scratch the surface. But we did get to talk about some specific practices that have helped the Systems Engineering team at Bronto to work much more effectively.

This was the first time we’ve tried live streaming a DevOps meetup, and while the A/V quality leaves a lot to be desired this time, we’re going to get better at this. The slides aren’t visible in the video itself, but I’ve made them available.

The talk itself was based on the prevailing wisdom of the DevOps community, as well as my own successes (and failures) in fomenting DevOps transformation at Bronto and beyond.

I lead right in with the subject of failure. Our attitudes about failure in a DevOps culture must be an attitude of scientific curiosity. Each failure is an incredible opportunity to learn about the fragility of our systems, of our processes, and allow us to shore those things up in a planned fashion rather than during a real crisis (hopefully). The intimidation of searching for “root cause” (often a human being) is giving way to something more blameless, sincerely inquisitive, where people feel safe talking about their contributions to the failure because 1) they’ve learned something valuable and 2) know they won’t be penalized for it. And we’ve also learned that when things break, there usually isn’t any one specific cause, but a number of contributing factors to reveal.

At a very high level, we were able to cover some of the organizational alignment gotchas, and how to approach the transformation from a cultural leadership perspective. We talked about how much easier these things can be when you’ve got executive support, not necessarily from a command and control perspective, but just from the point of influence and vision.

We also got to talk a bit about how my own approach to DevOps transformation borrows heavily from the Toyota Production System and the Toyota Way, as well as from the Agile values and principles. We talked a bit about how getting lined up on values and principles is so much more important than the technology or the process, because when you’re making tough decisions about technology and process, the values and principles will serve as a compass so you don’t get lost.

Time was super-tight, but I introduced the idea of one-piece flow and of work cells (which we call “matrix teams”, “cross-functional teams”, or “full stack” teams in modern software parlance). We even had a neat exercise after the talk to bridge the divide between a number of functional teams and reassemble people into a cross-functional work cell, eliminating costly handoffs and getting our new features in front of customers more expediently.

We did talk a good bit about the value of retrospective meetings, how they can be structured, and how to get the most value out of them. Based on the number and quality of questions I was getting both during and after the talk on this subject, I think it’s something that people are hungry to learn more about. You might want to check out Agile Retrospectives: Making Good Teams Great.

A good amount of time was spent in dialogue about workload leveling, or heijunka. Sadly, this is one of those many subjects where a few minutes just doesn’t do it justice. We might have to revisit this.

There was a lot of interest in Kanban, so some of us hung around after the talk and actually collaboratively constructed a one piece flow for Kanban cards. I think there’s a lot of hunger in the community to learn more about how to do this, and also for software that serves this workflow better. There’s probably an opportunity in the marketplace for someone who really gets it with Kanban to give us the software we need. And while a physical board is a simple and easy way to do it, the reality of the software world is that we leverage technologies to have geographically diverse teams that need their sources of truth to be easily available and accessible online.

I talked a bit about how we’re measuring success using our Kanban one-piece flow at Bronto, and how the one-piece flow made our waste easier to identify and eliminate. Of course, we looked at real-world measurements of improvement in a team following these practices, and identified other areas for kaizen (continuous improvement). My team at Bronto was always effective, always getting our critical work done on-time, but there was room for doing better than that, for holding ourselves to a higher standard. I’m particularly proud of how much more consistently performant our team has gotten since we’ve started instituting and improving on better planning and work practices, borrowed heavily from Lean and Agile influences.

Screen Shot 2014-09-18 at 2.50.28 PM
Typical performance chart used to measure improvement.

Out of all of the things we did get to cover, as well as the few things we didn’t get to cover much, the one thing that really seemed to resonate the most was probably the topic of retrospectives, or hansei-kai. The SysEng team at Bronto has been learning together to use deep and honest retrospective meetings to bring individual learning to the surface, and together turn it into organizational learning.

What’s the objective of the retrospective (or retro)? Simply put, we reflect on where we’ve been since the last retrospective. We acknowledge the things we’ve changed that have worked out well for us that we wish to continue (or fine-tune). We raise attention to the things that we could be doing better at, and come up with specific action items for improvement before the next retro. We also reflect on the action items from the previous retro, and ask ourselves how well we achieved those objectives.

These retrospective meetings can be very raw. What we’re doing here is exposing deeper opportunities that may be more readily discovered through related surface symptoms. That usually involves some honest and introspective dialog, so there must be a mutual feeling of trust that individual weaknesses won’t be exploited. To that end, we keep the meeting attendance extremely limited. One of the changes we made to improve honesty was actually to un-invite the manager from the room so that individuals would be more at ease expressing their concerns. But then we invite him back in the room at the end of the meeting to discuss the collective action items. All of the raw inputs to those action items, the sometimes sensitive individual disclosures, remain confidential.

In terms of meeting conduct, the retrospective must be honest and respectful. One of the things that was a little hard to institute at first was a “no devices” rule. Whatever you’re doing on the computer can wait. The phone can wait. Your eyes and mind and attention are here. Your team needs you, all of your attention, to get better together. But by instituting this rule, we’ve realized better engagement in the retrospectives, and thus more impactful improvements come out of them.

How often should retrospectives be held? Well, practitioners of different agile practices might do them weekly or biweekly. We’ve been trying it out monthly, which is probably as far out as you can do them without losing a lot of their value through lost memories of pain from early in the cycle. Additionally, when you have them farther apart, the duration of the meeting gets longer. A two hour retro is not at all unusual if there is a hot topic keeping the conversation lively. But we’re probably going to try holding these more frequently, like bi-weekly, and for shorter durations to cut down on meeting fatigue. There’s not a hard fast rule here. In everything we do, from an educated position we try something, we learn from that attempt, we improve and iterate over it. We do that within the retrospective. We even do that with the retrospective itself.

Does it work? I believe the answer is a quantified and unqualified yes. I know this because we’ve been measuring things that matter, like what’s the lead time for getting new work done for our customers… and how consistent are we at that performance number. How well are we leveling our work load? Once we start something, how good are we at finishing it without getting distracted? Is our quality improving? I can see, without a doubt, that our changes have led us to a place where I know how long it’s going to take to get something done if it takes a normal path in our work pipeline, and how long it will probably take to get done if it’s prioritized. We didn’t really have that sort of data-driven insight before, but this path has given us more and more tooling to use scientific method to make more effective decisions with a higher confidence level in their outcome. That’s something that will only continue to assure that we’re the best marketing platform for commerce out there.