Whiteboard Confessional: Click Here to Break Production
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.Links CHAOSSEARCH @QuinnyPig TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.Today on the AWS Morning Brief: Whiteboard Confessional, I'm telling a different story than I normally do. Specifically, this is the tale of an outage from several weeks ago. The person who shared this story with me has requested to remain anonymous and further wishes me to not mention their company at all. This is, incidentally, a common occurrence. Folks don't generally want to jeopardize their relationship with AWS by disclosing a service issue they see, whereas I don't have that particular self-preservation instinct. Then again, I'm not a big AWS customer myself. I'm not contractually bound to AWS in any meaningful way, and I'm not an AWS partner, nor am I an AWS Hero. So, all that AWS really has over me in terms of leverage is the empty threat of taking away my birthday. So, let's dive into this anonymous story. It's a good one. A company was minding its own business, and then had a severity one incident. For those who aren't familiar with that particular designation, you can think of that as being the company's primary service fell over in an embarrassingly public way. Customers noticed, and everyone runs around screaming a whole lot. Now, if we skip past the delightful hair-on-fire diagnosis work, the behavior that was eventually tracked down was that an SNS topic had a critical listener get unsubscribed. That SNS topic invoked said listener, which in turn drove a critical webhook call via API gateway. This is a bad thing, obviously. Fundamentally, customers stopped receiving webhooks that they were expecting, and this caused a nuclear meltdown given the nature of what the company does, which I can't disclose and isn't particularly relevant anyway. But, for those who are not up to date on the latest AWS terminology, service names, and parlance, what this means at a high level is that a thing happens inside of AWS, and whenever that thing happens, it's supposed to fire off an event that notifies this company's paying customers. This broke because something somewhere unsubscribed the firing off dingus from the notification system. Now that we're aware of what caused the issue at a very high level, time to dig into how it happened and what to do about it. But first:In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. The logs for who unsubscribed it are, of course, empty, which is a problem for this company’s blameless-in-theory-but-blame-you-all-the-way-out-of-the-company-if-it-turns-out-that-it-was-you-that-clicked-this-thing-and-didn't-tell-anyone, philosophy. CloudTrail doesn't log this event because why would it? CloudTrail’s primary purpose is to rack up bills and take the long way around before showing events in your account, not to assist with actual problem diagnosis, by all accounts. Now, fortunately, this customer did have AWS Enterprise Support. It exists for precisely this kind of problem. It granted them access to the SNS team which had considerably more insight into what the heck had happened, at which point the answer became depressingly clear, as well as clearly depressing. It turns out that the unsubscribe URL at the bottom of every SNS notification wasn't authenticated. Therefore, anyone who had access to the link could have invoked it, and that's what happened when a support person did something very reasonable: Copy and paste a log message containing that unsubscribe link into a team Slack channel. It wasn't their fault [00:06:04 unintelligible] because they didn't click it. The entity triggering this was—and I swear I'm not making this up—Slackbot. Have you ever noticed that when you paste a URL into Slack, it auto expands the link to show you a preview? It tries to do that on every URL, and you can't disable URL expansion at the -Slack workspace level. You can blacklist URLs but only if the link expansion succeeds. In this case, it doesn't have a preview, so it doesn't succeed, so there's nothing for it to blacklist. Slack’s helpful feature can't be disabled on a team-wide level, so when that unsubscribe URL shows up in a log snippet that got pasted, it silently unsubscribed the consumer from SNS and broke the entire system. Now, there are an awful lot of things that could have been different here. Isn't this the sort of thing that might be better off with SQS, you might reasonably ask? Well, four years ago, when this system was built, SQS itself could not, and did not support invoking Lambda functions, so SNS was the only real option. That's incidentally why it is so very important to have decent surface coverage when you launch an AWS service. Otherwise, stuff like this makes it into production and doesn't get revisited. And think about it for a minute. Why would it? If you have a thing that's working, why would it ever occur to you to go back and then have something like SQS or Kinesis invoke this system instead, now that that's supported because SNS is there? It's working. It's been working for four years. Until suddenly, today, it wasn't. People don't go back and adjust things without a compelling reason. It never makes the priority list until something like this happens. It's a common pattern. You can also get around this by mandating that an unsubscribe request in SNS be authenticated rather than supporting anonymous unsubscribe. But that's arcane knowledge the first time. It's commonplace only the second time you encounter this. For example, if you've listened this far, you now know it can be done. The parameter name, because you're going to need this, is AuthenticateOnUnsubscribe. And I'm sorry that you have to know that now. But again, this was working as it was for four years until something else external happened that impacted this. There's a lot of things that could have been done differently, and I'm not casting blame on anyone. But I will say—looking at this from my perspective—now that I've told this story, look at your own environment and think for a minute. What links get tossed into Slack teams at various times? Would any of them cause problems if an unauthenticated user were to click on them? Because, fundamentally, that's what's happening under the hood, courtesy of Slackbot. Maybe consider whether or not that's a good pattern, and do something about it before it ends in disaster that touches your customers. And maybe consider that the next time, before you go all-in on the current ChatOps hype nonsense. This has been the AWS Morning Brief: Whiteboard Confessional. I'm Cloud Economist Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. If you didn't enjoy this podcast, please leave a five-star review on Apple Podcasts anyway, and it's okay because you are probably Slackbot or work closely with Slackbot. Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.Announcer: This has been a HumblePod production. Stay humble.