Whiteboard Confessional: Configuration MisManagement
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.Show Notes CHAOSSEARCH.io Twitter: https://twitter.com/QuinnyPig TranscriptCorey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.Historically, many best practices were, in fact, best practices. But over time, the way that we engage with systems changes. The problems that we’re trying to solve for start resembling other problems. And, at some point entire industries shift. So, what you should have been doing five years ago is not necessarily what you should be doing today. Today, I’d like to talk a little bit about not one or two edge case problems, as I have in previous editions of the Whiteboard Confessional, but rather, I want to talk about an overall pattern that’s shifted. And that shift has been surprisingly sudden, yet gradual enough that you may not entirely have noticed. This goes back into, let’s say 2012, 2013, and is in some ways the story of how I learned to speak publicly. So this is indirectly one of the origin stories of me as a podcaster, and continuing to engage my ongoing love affair with the sound of my own voice. I was one of the very early developers behind SaltStack. Salt, for those who are unfamiliar, is a remote execution framework slash configuration management system that let me participate in code development. It turns out that when you have a pattern of merging every random pull request that some jackass winds up submitting, and then immediately submitting a follow up pull request that fixes everything you just merged in, it’s, first, not the most scalable thing in the world, but on balance provides such a wonderful welcoming community, that people become addicted to participating in it. And SaltStack nailed this in the early days. Now, before this, I’d been focusing on configuration management in a variety of different ways. Some of the very early answers for this were CFEngine, which was written by an academic and is exactly what you would expect an academic to write. It feels more theoretical than it does something that you would want to use in production. But okay, Bcfg2 was something else in this space, and the fact that that is its name tells you everything you need to know about how user-friendly that was. And then the world shifted. We saw Puppet and Chef both arise. You can argue which came first, I don’t care enough in 2020 to have that conversation. But they wound up promoting a model of a domain-specific language, in Puppet’s case, versus chef where they decided, “All right, great, we’re gonna build this stuff out in Ruby.” From there, we then saw a further evolution of Ansible and SaltStack, which really round out the top four. Now, all of these systems fundamentally do the same thing, which is how do we wind up making the current state of a given system look like it should? That means, how do you make sure that certain packages are installed across all of your fleet? How do you make sure that the right users exist across your entire fleet? How do you guarantee that there are files in place, that have the right contents? And when the contents of those files change, how do you restart services? Effectively, how do you run arbitrary commands and converge the state of a remote system so it looks like it should? Because trying to manage systems at scale is awful. You heard in a previous week what happened when I tried to run this sort of system by using a Distributed SSH client. Yeah, it turns out that mistakes are huge and hard to fix. This speaks toward the direction of moving into cattle instead of pets when it comes to managing systems. And all of these systems more or less took a different approach to it. And some were more aligned with how I saw the world than others did. So I started speaking about SaltStack back in 2012 and giving conference talks. The secret to giving a good conference talk, of course, is to give a whole bunch of really terrible ones first, and woo boy were these awful. I would put documentation on the slides. I would then read the documentation to people frantically trying to teach folks the ins and outs of a technical system in 45 minutes or less. It was about as engaging as it probably sounds like. Over time, I learned not to do that, but because no one else was speaking about SaltStack I was sort of in a rarefied position of being able to tell a story, and learn to tell stories, about a platform that I was passionate about, as it engaged a larger and larger community. Now, why am I talking about all of this on the Whiteboard Confessional? Excellent question. But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. The reason I bring up configuration management across the board is not because I want to talk about the pattern of doing terrible things within it, and oh, there are terrible patterns, but rather to talk about how in 2020, it is almost certainly an anti-pattern if you’re using it at all. Let me explain. The idea of configuration management was how you wind up getting a bunch of systems converged to a common state. Well, in the world of immutable infrastructure, in no small part ushered in by Docker, and later other systems like it, suddenly spinning a new container took less than a second, and it was exactly as you’d want it to be. So one-line configuration changes didn’t require redeploying fleets of servers, just iterating forward to the next version of the container. That, in turn, became a much better approach, and then you could get into a world where there wasn’t even an SSH daemon running, so people could not get into these containers in the first place to muck around and break the config, which was a big piece of why you had to have these things running continuously. And over time, as we saw this pattern continued to emerge, the number of systems and use cases for which having something stateful that was managing this became fewer and fewer. Now, with the rise of cloud providers being what they are, you wind up having a slightly different problem, where, if I’m trying to provision a whole bunch of EC2 instances to do things, well, first, I can change the AMI, and we’ve been able to do that for 15 years, but config management was still a great plan back then. Now with things like auto-scaling groups, with containers running on top of these systems, you’re getting into a place where you’re using something like Chef for initial configuration of an instance, maybe, at best, but you’re not going to be running this in a persistent way. where it’s constantly checking in with a central control system anymore. In fact, an awful lot of the things that Chef historically would do are now being handled by user data, where you give a system a script to run on the first time that it boots. Once that script completes successfully, it reports into a load balancer whatnot as being ready to go. But even that is something of an anti-pattern now because historically, we had the problem of, “Okay, I’m going to have my golden image, my AMI that spins up everything that I’d want, then I’m going to hand off to Chef. So you would see these instances being spun up, and then it would take 20 minutes or so for it to run through all the various Chef recipes and pass the health checks. That’s great, but if you need that instance to be ready sooner than that, that’s not really a great pattern. That, in turn, has led to people who are specialized in the configuration management space, feeling the ground beneath their feet start to erode. If you look at any seriously scaled out system, there’s a bit of that in there, but most of the stuff that matters, most of the things break systems at scale are being run through things like Terraform instead, where you’re provisioning exactly what needs to be there up front, letting your cloud provider get the right AMI or baseline image into production, and from there, everything else being handled by things like [etcd 00:10:22], or by which version of a container they’re running to service a particular service. Now, I don’t agree with all of the steps that people have taken down this path, in fact, I make fun of an awful lot on this and other shows, but the problem that we’re seeing that these things solve has not gone away, but it’s become much smaller, which means that, eventually, if you’re specializing in configuration management, as I once did, you’re going to take a look around and you’re going to realize that this is not a growth industry in any meaningful sense for these vendors who generally tend to charge on a per-host basis. Sure, there are existing large scale environments that are continuing to expand their usage of these services, but there isn’t nearly as much net new going on there, which means that we’re starting to see what looks like an industry in decline. One of my common career tips is to imagine that I’m going to start a company tomorrow. And in the next 10 years, it’s going to grow from a tiny startup to a component of the SaP 500. From a tiny company to a giant company. At what point during that evolution over the next 10 years, does that tiny startup become a customer of a particular tool or a particular market segment? If the answer is, it probably doesn’t, then whatever is serving that industry niche is almost certainly not going to survive longer term in its current state. Sure, there’s a giant long tail of existing big e-Enterprise offerings out there that are going to address that in a bunch of interesting and exciting ways, but that’s not something that I would necessarily hang my hat on as a technologist trying to focus on something that is going to be relevant for the next five to 10 years. I, instead, would prefer to focus on something that’s growing. That means today, cloud services? Absolutely. Serverless? Most definitely. Kubernetes? Unfortunately. Docker? Well, sort of. That’s largely not become its own independent skill set anymore. It’s started to slip below the surface of awareness of what people care about, in the same way as almost everything else we talk about on this show eventually will as well, leaving in the realm of a few highly advanced specialists to wind up playing around with them. And that, in short form, is what happened to my once passionate love affair with configuration management, now turning into something I basically don’t even bother to ask people about, because I can relatively safely assume that the answer is not nearly as relevant as it once was, to their technical success. For the Whiteboard Confessional sub-series of the AWS Morning Brief, I’m Cloud Economist Corey Quinn, and I’ll talk to you next week.Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig, and let me know what I should talk about next time.Announcer: This has been a HumblePod production. Stay humble.