Whiteboard Confessional: The Bootstrapping Problem
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.Links CHAOSSEARCH @QuinnyPig TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.Hello, and welcome to this edition of the AWS Morning Brief: Whiteboard Confessional, where we confess our various architectural sins that we and others have committed. Today, we're going to talk about, once upon a time, me taking a job at a web hosting provider. It was the thing to do at the time because AWS hadn't eaten the entire world yet, therefore, everything that we talk about today was still a little far in the future. So, it was a more reasonable approach, especially for those with, you know, budgets that didn't stretch to infinity, or willingness to be an early adopter of someone else's hosting nonsense to go ahead and build out something in a data center. Now, they were obviously themselves not hosting on top of a cloud provider because the economics made less than no sense back then. So, instead, they had multiple data centers built out that provided for customers various hosting needs. Each one of these was relatively self-contained unless customers wound up building something themselves for failover. So, it wasn't really highly available so much as it was a bunch of different single points of failure, and an outage of one would impact some subset of their customers, but not all of them. And that was a fairly reasonable approach provided that you communicate that scenario to your customers because that's an awful surprise to have later in time. Now, I was brought in as someone who had had some experience in the industry, unlike many of my colleagues who had come from the hosting provider’s support floor and promoted into systems engineering roles. So, I was there to be the voice of industry best practices, which is a terrifying concept when you realize that I was nowhere near as empathetic or aware back then as I am now, but you get what you pay for. And my role was to apply all of those different best practices that I had observed, and osmosed, and had bluffed, into what this company was doing, and see how it fit in a way that was responsible, engaging, and possibly entertaining. So, relatively early on in my tenure, I was taking a tour of one of our local data centers and asked what I thought could be improved. Now, as a sidebar, I want to point out that you can always start looking at things and pointing out how terrible they are, but let's not kid ourselves; we very much don't want to do that because there are constraints that shape everything that we do and we aren't always aware of them. So, making people feel bad for their choices is never a great approach if you want to stick around very long. So, instead, I started from the very beginning, and played, “Hi. I'm going to ask the dumb questions, and see where the answers lead me to.” So, I started off with, “Great, scenario time. The power has just gone out. So, everything's dark, now how do we restart the entire environment?” And the response was, “Oh, that would never happen.” And to be clear, that's the equivalent of standing on top of a mountain during a thunderstorm, cursing God while waving a metal rake into the sky. After you say something like that there is no disaster that is likelier. But all right, let's defuse that. “Humor me. Where's the runbook?” And the answer is, “Oh, it lives in Confluence,” which is Atlassian’s wiki offering. For those who aren't aware, Wikis in general, and Confluence in particular, is where documentation and processes go to die. “These are living documents,” is a lie that everyone says because that's not how it actually works. “Cool. Okay, so let's pretend that a single server instead of your whole data center explodes and melts. When everything's been powered off, you turn it back on. That one doesn't survive the inrush current, and that one server explodes. That server happens to be the Confluence server. Now what? How do we bootstrap the entire environment?” The answer was, “Okay, we started printing out that runbook and keeping it inside each data center,” which was a way better option. Now, the trick was to make sure that you revisited this every so often, when something changed, and make sure that you weren't looking at how things were circa five years ago, but that's a separate problem. And this is fundamentally a microcosm of what I've started to think of as the bootstrapping problem. I'll talk to you a little bit more about what those look like in the context of my data center atrocities. But first:This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.Now, let's talk about some of these bootstrapping atrocities. Pretend that you have a fleet of physical machines all running virtual machines, and your DNS servers, two in every environment as a minimum, live inside of a VM for easy portability. Great. So, that makes sense; your primary and your secondary wind up being in virtual machines, you can migrate them everywhere. What happens if they migrate onto the same physical server? You have now taken a redundancy and shortcutted it back to a single server going down, causing problems for absolutely everything. If it dies, then there's no DNS in your data center, and everything else will rapidly stop working. Let's take it a step further. Assume a full site-wide power outage there. If your physical servers need DNS to work in order to boot all the way into a state where they can launch virtual machines successfully and there are no DNS servers available, now what? Well, now you're in trouble. Maybe the answer is to remove that DNS dependency on getting virtual machines up and running. Maybe it's to make the DNS servers physical nodes that each live in different racks and that's sort of an exception to your virtual machine approach. There are a lot of options you can take here, but being aware the problem and failure mode exist is where you have to start. Without that, you won't plan around it. Another example is a storage area network, or SAN. These things are usually highly redundant, incredibly expensive, and are designed to be always available. But if they're not due to misconfiguration, power issues, someone unplugging the wrong thing, et cetera, suddenly, you've put an awful lot of eggs into one incredibly expensive basket. How do you recover from a SAN outage? Have you ever tried it? Do any of the parts that you're going to need to recover from that outage require, you know, that SAN to be available, or things that are on that SAN to be available? How things break and what those failure modes look like, are a problem you need to be aware of. And this carries forward into a time of Cloud. If you wind up having a DR plan where you're going to failover from us-east-1, to us-east-2 in AWS land—so from Virginia to Ohio—great. It works super well when you do a DR exercise. But when there's an actual outage in us-east-1, you're not the only person with that plan. So, suddenly, there's control plane congestion, which leads to incredible latency. It may take 15 to 20 minutes for an instance to finish coming up, where before it took 40 seconds. It's the herd of elephants problem that you're never going to surface in a mock DR test, it only tends to manifest when you start seeing everyone doing the same thing. The way you get around this is, well there are a few ways, but one is to have those systems provisioned and up and running even though you don't need them right now. This is sort of an economic trade-off between what makes sense economically versus what your durability is going to look like. It's a spectrum, and you have to figure out what makes sense. A less expensive approach might be to go in the opposite direction. If you're planning for half the internet to fail from Virginia over to Ohio, maybe you prepare for the opposite: have everything running steady-state in Ohio, and then when that goes down, you spin up things in Virginia or Oregon. It's still going to have that herd of elephants problems, but it's going to be less common than the other direction.One last single point of failure that I want to highlight is the company credit card for many companies. If your AWS payment fails to go through, what then? Most people have not configured a secondary or tertiary method of payment, and then it becomes a serious problem. If you're not big enough to have your own account manager and a relationship with that person—spoiler, by the way: every account has an account manager. Most don't know it—but if you don't have a working relationship, you can wind up with resources suspended if you miss some increase in the frantic emails. This gets closer into something else that we're going to talk about next week, namely, the underpants problem.This has been another episode of the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn, fixing your AWS bills and your AWS architectures at the same time. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, Whereas if you've hated it, please leave a five-star review on Apple Podcasts, and tell me what single points of failure I have failed to consider.Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.Announcer: This has been a HumblePod production. Stay humble.