Whiteboard Confessional: The Curious Case of the 9,000% AWS Bill Increase

About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.Links CHAOSSEARCH @QuinnyPig Chris Short’s LinkedIn: https://www.linkedin.com/in/thechrisshort/ TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.When you're building on a given cloud provider, you're always going to have concerns. If you're building on top of Azure, for example, you're worried your licenses might lapse. If you're building on top of GCP, you're terrified that they're going to deprecate all of GCP before you get your application out the door. If you're building on Oracle Cloud, you're terrified, they'll figure out where you live and send a squadron of attorneys to sue you just on general principle. And if you build on AWS, you're constantly living in fear, at least in a personal account, that they're going to surprise you with a bill that's monstrous.Today, I want to talk about a particular failure that a friend of this podcast named Chris Short experienced. Chris is not exactly a rank neophyte to the world of Cloud. He currently works at IBM Hat, which I'm told is the post-merger name. He was deep in the Ansible community. He's a Cloud Native Computing Foundation Ambassador, which means that every third word out of his mouth is now contractually obligated to be Kubernetes.He was building out a static website hosting environment in his AWS account, and it was costing him between $10 and $30 a month. That is right aligned with what I tend to cost. And he wound up getting his bill at the end of the month: “Welcome to July, time to get your bill,” and it was a bit higher. Instead of $30, or even $40 a month, it was $2700. And now there was actual poop found in his pants.This is a trivial amount of money to most companies, even a small company, and I say this from personal experience, runs on burning piles of money. However, a personal account is a very different thing. This is more than most people's mortgage payments if you don't make terrible decisions like I do, and live in San Francisco. This is an awful lot of money, and his immediate response was equivalent to mine. First, he opened a ticket with AWS support, which is an okay thing to do. Then he immediately turned to Twitter, which is the better thing to do because it means that suddenly these stories wind up in the public eye.I found out roughly 10 seconds later, as my notifications blew up with everyone saying, “Hey, have you met Corey?” Yes, Chris and I know each other. We're friends. He wrote the DevOps’ish newsletter for a long time, and the secret cabal of DevOps-y type newsletters runs deep. We secretly run all kinds of things that aren't the billing system for cloud providers.So, he hits the batphone. I log into his account once we get a credential exchange going, and I start poking around because, yeah, generally speaking, 100x bill increase isn't typical. And what I found was astonishing. He was effectively only running a static site with S3 in this account making the contents publicly available, which is normal. This is a stated use case for S3, despite the fact that the console is going to shriek it's damn fool head off at you at every opportunity, that you have exposed an S3 bucket to the world.Well, yes, that is one of its purposes. It is designed to stand there, or sit there depending on what a bucket does—lay there, perhaps—and provide a static website to the world. Now, in a two-day span, someone or something downloaded data from this bucket, which is normal, but it was 30 terabytes of data, which is not. At approximately nine cents a gigabyte, this adds up to something rather substantial, specifically after free tier limits are exhausted, that's right: $2700.Now, the typical responses to what people should do to avoid bill shocks like this don't actually work. “Well, he should have set up a billing alarm.” Yeah, aspirationally the AWS billing system runs on an eight-hour eventual consistency model, which means that at the time the bill starts spiking. He has at least 8 hours, and in some cases as many as 24 to 48, before those billing alarms would detect. The entire problem took less time than that.So, at that point, it would be alerting after something had already happened. “Oh, he shouldn't have had the bucket available to the outside world.” Well, as it turns out, he was fronting this bucket with CloudFlare. But what he hadn't done is restrict bucket access to CloudFlare’s endpoints, and for good reason. There's no way to say, “Oh, CloudFlare’s, identity is going to be defined in an IAM managed policy.” He has to explicitly list out all of CloudFlare’s IP ranges, and hope and trust that those IP ranges will never change despite whatever networking enhancements CloudFlare makes, it's a game of guess and check and having to build an automated system around this. Again, all he wanted to do was share a static website. I've done this myself. I continue to do this myself and it costs me, on a busy month, pennies. In some rare cases, dozens of pennies.Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.So, we start looking at this and it becomes abundantly clear that there is a serious problem here. It all came out of that bucket; there was nothing else being used in this environment that would have caused this. The first 10 terabytes were charged at nine cents a gigabyte, then down to eight and a half cents per gigabyte for the next 40 terabytes, which it never exceeded, and that's where it was left: at about roughly 30 and a half terabytes. Now, what caused this? Nobody knows because logging wasn't enabled on this bucket and you can't do that retroactively, so we have no idea what objects were involved, or which requester IPs are involved.So, anything you tell Chris that he should have done is telling him this after it's too late to do anything. Again, he is not some random fool who fails to understand how computers work. He's a Cloud Native Computing Foundation Ambassador, he works at IBM Hat, and he is very well known in this space. A coulda woulda, shoulda response here does not work because it comes down to the fact that, yeah, this is actual personal money here. He tweeted a thank you to his wife for not effectively having a mild heart attack in response to this.Now, at the time of this recording, I have it on good authority that AWS is going to make this right in his account. But he's depending upon the largesse of a company to look at this and say, “Oh, that's okay.” There's no guarantee that that's going to happen. That's a concern that he should not have to deal with here. There's a world of difference between a person on a personal account playing around with AWS, and what an enterprise would expect to happen.I've been agitating for a while for a better approach to the free tier, and this highlights exactly why that is. I want a model where rather than charging me through the nose, you stop serving traffic. A bunch of toy experiments I have in place would benefit greatly from this particular model because I don't want to wind up having to cut a mortgage check to AWS, I just want it to stop serving my ridiculous thing that accidentally got stuck in a loop or got pushed to the wrong place.Again, I want to be explicitly clear here. Setting up this static site, Chris did nothing wrong by any measure. This was not a misconfigured S3 bucket. This was not passing data back and forth 50 times because of bad architecture. He is using the system as it was intended to be used, and as it was designed to be operated, and then a surprise bill hit him out of nowhere.Now, what's fascinating to me about this is things like this happened to my clients all the time, and they really don't care all that much. When you're spending a couple million dollars a month, you don't really care about a $3,000 bill surprise. In fact, it's hard to find it in the noise of everything else going on in the account. So, things like this aren't on larger company’s radars as a risk.There are ways to wind up catching this from the outside. Everything I can think of though is a little nutty. You could enable advanced monitoring on the S3 buckets. You can see the bytes downloaded on a per bucket basis and then have deviation monitoring. But that's going to be super noisy because it's going to go off whenever you use it, and it's significantly above where it was before. It's an awful lot of problem here, and there still is no good solution other than looking at Chris and clucking at him, telling him he did it wrong.So, when I went through and did this analysis, last month's bill was just shy of $2700. This month bill so far—and I'm recording this on July 6th—was 28 cents, but in a final, just to make sure it hurts you a little bit more, AWS now predicts the bill for this month in his account for the month of July to be just over $1100. This is one of those customer profiles that everyone starts out at, and it's a profile that gets left behind by the current painful process and horrifying nightmare that is the AWS billing system.I'm Cloud Economist Corey Quinn. This is the AWS Morning Brief: Whiteboard Confessional series. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you’ve hated this podcast please leave a five-star review on Apple Podcasts and leave an S3 bucket open for me.Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.Announcer: This has been a HumblePod production. Stay humble.

2356 232