AWS Cost Anomaly Detection 2: Electric Boogaloo
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.Pete: Hello, and welcome again to the AWS Morning Brief: Whiteboard Confessional. Corey is still enjoying some wonderful family time with his new addition, so you're still stuck with me, Pete Cheslock. But I am not alone. I have been joined yet again, with my colleague, Jesse DeRose. Welcome back, Jesse.Jesse: Thank you for having me. I will continue to be here until Corey kicks me back off the podcast whenever he returns and figures out that I've locked him out of his office.Pete: We'll just change all the passwords and that'll just solve the problem.Jesse: Perfect.Pete: What we're talking about today is the “AWS Cost Anomaly Detection, Part Two: Electric Boogaloo.”Jesse: Ohh, Electric Boogaloo. I like that. Remind me what that's from. I feel like I've heard that before.Pete: Okay, so I actually went to go look it up because all I remembered was that there was, like, a movie from the past, “Something Two: Electric Boogaloo,” and I dove to the internet—also known as Wikipedia—and I found it it was a movie called Breakin’ 2: Electric Boogaloo], which is a 1984 film. And it says it's a sequel to the 1984 breakdancing film Breakin’: Electric Boogaloo, which I thought was kind of interesting because I always thought of that joke ‘Electric Boogaloo’ was as related to the part two of something, but it turns out it's not. It's actually can be used for both part one and part two.Jesse: I feel like I'm a little disappointed, but now I also have a breakdancing movie from the ’80s to go watch after this podcast.Pete: Absolutely. If this does not get added to your Netflix list, I just—I don't even want to know you anymore.Jesse: [laughs].Pete: What's interesting, though, is that there was a sequel called Rappin’, which says, “Also known as Breakdance 3: Electric Boogalee.”Jesse: Okay, now I just feel like they're grasping at straws.Pete: I wonder if that was also a 1984 film. Like, if all of these came out in the same year. I haven't looked that deep yet.Jesse: I feel like that's a marketing ploy, that somebody literally just sat down and wrote all of these together at once, and then started making the films after the fact.Pete: Exactly. One last point here, because it's too good not to mention, was that it basically says that all these movies, or at least the later one, had an unconnected plot and different lead characters; only Ice-T featured in all three films, which then got me to think a sec—wait a second, Ice-T was in this movie? Why have I not watched this movie?Jesse: Yeah. This sounds like an immediate cult classic. I need to go watch this immediately after this podcast; you need to go watch this.Pete: Exactly. So, anyway, that's the short diversion from our, “AWS Cost Anomaly Detection, Part Two” discussion. So, what did we do last time? Why is this a part two? Hopefully, you have listened to our part one. It was, I thought, quite amazing—but I'm a little bit biased on that one—where we talked about a new service that was very recently announced at Amazon called AWS Cost Anomaly Detection. And this is a free—free service, which is pretty rare in the Amazon ecosystem—that can help you identify anomalies in your spend. So, we got a bit of a preview from some of the Amazon account product owners for this Cost Anomaly Detection, and then we got a chance to just dive into it when it turned on a few weeks ago. And it was pretty basic. It's a basic beta service—they actually list it as beta—and the idea behind this is that it will let you know when you have anomalies in your cost data, primarily increases in your cost data. I remember specifically talking that it was specifically hard to identify decreases in spend as an anomaly. So, right now it only supports increases. So, a few weeks ago, we went into our Duckbill production accounts, turned it on, and we were just waiting for anomalies so that we could do this.Jesse: I also think it's worth noting that I'm actually kind of okay with it being basic for now because if you look at almost any AWS service that exists right now, I would say none of them are basic. So, this is a good place to start and gives AWS opportunities to make it better from here without making it convoluted or difficult to set up in the first place.Pete: A basic Amazon service, much like myself.Jesse: [laughs].Pete: So, guess what? We found anomalies. Well, we didn't find them. The ML backing Cost Anomaly Detection found some anomalies. So, that's what we're here to talk about because now that we actually have some real data, and real things happened, and we actually dove into some of those anomalies, interestingly enough. So, that's what we're here to talk about today.Jesse: It's also probably worth noting that we changed our setup a few times over the course of kicking the tires on this service, and unfortunately, we weren't able to thoroughly test all of the different features that we wanted to test before this recording. So, we do still have some follow up items that we'll talk about at the end of this session. But we did get a chance to look at the majority of options and features of this service, and we'll talk about those today.Pete: So, if you remember—or maybe you don't because you didn't listen to the last episode we did—we configured a monitor, is what it's called, that will analyze your account based on a few different criteria. And the main one is, I think it will just look at the different AWS services across your AWS service monitor. And you can have it go look at specific accounts, look at specific cost allocation tags, there's a whole type of setup that you can do for these alerts. And the only real configuration choice that you have to make is an alert threshold. And this was something that took us a little bit to kind of understand, and I think we both really understand it a lot better now. And we made a change, right? Like, what we thought it was, wasn't totally what it was.Jesse: Yeah, initially, this was a little bit confusing for me, and it took us a while to wrap—it at least took me a while to wrap my brain around the difference between the anomaly itself and the alert threshold. Effectively, the anomaly can be any dollar amount: it can be any amount of spend over the basic amount of spend that's expected for that particular service or that particular monitor that you've enabled, but the alert threshold itself is just the threshold at which we want to be alerted if the anomaly itself is over that threshold. So, in our case, when we first enabled the service, we set the alert threshold at $10. But all the anomalies that we saw were much lower than that. They were all about $1 apiece. So, we never got alerted to those anomalies, even though we did log into the console and see those anomalies.Pete: Yeah. I think that's really the key thing is the alert threshold is truly that.that. It is: when you get an anomaly, at what spend level it identified do you want to receive an alert? And the alerts that it can generate are, kind of, real-time where you can have it notify to an SQS endpoint. Our configuration had an SQS go to AWS Chatbot which will drop a message into our Slack. We reduced that, as Jesse said, down to $1 because I still kind of want to see what it looks like when it shows up in Slack. So, hopefully, we'll see that in a few weeks, or at, just, some point in the future. But then you can also have it just send, I think, daily—you can send these summaries daily or even weekly. I'm not sure if there was a monthly option. Maybe, Jesse, you remember that one. But—Jesse: Yeah, it looked like there was just daily and weekly for now.Pete: So, this came back to one of our original gripes, which was it didn't seem like you could create multiple anomaly alerts. So, for example, I might want to have the real-time stuff going into Slack just by default, but then maybe there's someone in my finance department, or my VP of engineering who's not in that Slack channel, doesn't really want the noise. They want to get the weekly one. It didn't look like there was a way to do that, and I think that's where we ran into an issue last time. We specifically got an alert—or an error for setting that up.Jesse: Yeah, and it's also worth noting that the error itself was rather vague. It said that we couldn't enable this alert but didn't tell us why, or how, or what part of the walkthrough was erroring out. And I can see that this would be really beneficial to allow individual contributors to see their spend alerts repeatedly in Slack, whereas somebody that's much higher up doesn't need that level of noise. So, they just want the report at the end of the day or the end of the week to know what's going on with their teams.Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.Pete: There is definitely a way, hopefully in the future when they enable this but, for some organizations, if I can set up these anomalies with various cost allocation tags and I have different product teams, or product owners, or business units or whatever, and I can notify these teams in some sort of reasonable way, giving them maybe a heads up, like, “Hey, here's some anomalies,” that could be super powerful. So, again, right now, it doesn't look like it, but again, it's beta and also, it's free. So, can we really—I mean, yes, we can, of course, complain that much about it. But still. [laughs].Jesse: Similar to what we said last week, I am thrilled that this service exists, even if there are things that we want from it. All of the questions that we have, all the content that we have from the previous session and this session, these are all things that are wishlist items; these are all things of, ways that this service can improve, but in no way is critiques of the existing service itself. There's definitely lots of room for improvement.Pete: Yeah. If you would like to hear more of our critiques of a service, just stay tuned for our future QuickSight product deep dive because—Jesse: Oh God. Don't even get me started. My eyes are already twitching.Pete: [laughs]. Don't worry, QuickSight team, we still love you. All right, so we found an anomaly, and the anomaly pointed us towards a root cause which, that is a bit of a charged word now in the technical ops communities, right?Jesse: Yeah, there's a lot of pushback against the phrase ‘root cause’ in the industry because in most cases—I’m going to butcher this really poorly, and there's many other sources that talk about this more clearly, but in most cases, there's not a single root cause. There's multiple contributing factors to any event. So, using this phrase is kind of frustrating for me, and I really wish that they had talked about these potential causes for the anomalous spend as such, as ‘potential causes’ rather than a ‘root cause.’Pete: And this feels like—I am not an English major, and I have not studied this area in-depth, but kind of it feels like the term ‘contributing factor’ would work here, that you could call these anomalies a contributing factor to this cost anomaly.Jesse: But again, it's also worth pointing out that we appreciate that this service highlights what it expects is the potential contributing factors. So, rather than just saying, “Hey, your spend for a particular service or particular monitor is up this much money.” It's actually pointing you to, potentially, what is causing that. So, there's definitely good coming out of this feature, but we just wish that it was renamed, which should be a simple request.Pete: We say that but, of course, having no idea what it takes to rename—[laughs]. But as you said, and so it identifies the anomaly in some way. And for us, we actually saw an anomaly related to S3. And this was actually related to a lot of other anomalies we found that were centered around our Athena and QuickSight usage, which is why our QuickSight wounds are still a little fresh. The S3 anomaly it identified for us a couple of weeks ago, and when you go and dive into this anomaly, it will provide you a link that will take you directly to Cost Explorer. And I really like this feature, by taking you to Cost Explorer, so that you can see everything broken out, every filter included in. It even includes something called ‘usage type,’ which if you're a Cost Explorer newbie, ‘usage type’ is, kind of, the billing code or the specific usage. So, in our scenario, it was a requests tier two usage type, that is, a class of requests for S3 that include gets, and puts, and things of that nature. And so we saw this much higher number of those tier two requests to S3, that could help us identify a little bit more about what was causing this.Jesse: And it's also worth noting that when you receive any anomaly detection, the service also gives you the opportunity to train the machine learning model: there's a little button up at the top that says, “Submit assessment.” And it says, basically, “Hey, did you find this detected anomaly to be helpful?” And you can say, “Yes, it was an accurate anomaly,” or, “No, it was a false positive.” Or, “Yes, it was an anomaly, but we expected it.” Which will ultimately help the model better understand what your spend looks like over time, where to expect anomalies, and how better to alert you, as an AWS customer, about your spend in the future.Pete: Honestly, I think this was a missed opportunity by the Cost Anomaly Detection team to not create an Amazon version of Clippy, where they could just have something pop up that's like, “I see you're trying to report this assessment of this anomaly.” I don't know what the character would be; I'm not very creative, but I think this was definitely a missed opportunity. So, I think they should grab some of those creative minds at Amazon and toss them at this problem.Jesse: [laughs]. I would love to see an AWS Clippy. I'm going to start a hashtag on Twitter for AWS Clippy.Pete: We'll get to work on that. I think another thing that we found that is definitely an area for improvement, especially if you have a lot of Amazon accounts, is when it does report the anomaly and it takes you in to where you can, kind of, view it in Cost Explorer, or report it, kind of, one level down, it'll list the region, the service, the account, but it lists the account by account number—Jesse: Yeah.Pete: —when we've got—what, like, I think we have a handful of accounts, maybe less than 10, but how does that work when you've got hundreds or more?Jesse: Yeah, I feel like this is a missed opportunity. I don't expect anybody to remember their linked account numbers for a single account, let alone for multiple or hundreds of accounts. And I really wish that there was a way that the service could tie the account number directly to the name on the account, or maybe the meta-name on the account, depending on how the account is set up. Something that gives the user who's looking at this information a little bit clearer data to dive into, and a little bit clearer opportunity to know, “Okay, I'm looking at this particular service in this particular account.” And rather than saying, “Oh, the account number is XYZ,” it is the production account, or it is the development account, or it is our security accounts really clearly off the bat.Pete: Look, I'm not a network administrator in the early aughts. I don't have numbers memorized, like IP addresses and such. So—Jesse: Oh, don't even get me started.Pete: I still have some of those IP addresses memorized. It's sad. I mean, meanwhile, I can't remember my kids’ birthdays, but yeah, the IP address from 2002 DNS server that I used to use, still burned in there, right? So, yeah, I think it was helpful, friendly names that could exist. And these connections exist, right? Your billing console will have that context. If your bill just gave you a bunch of account IDs and your total spend, I think many companies would have a hard time figuring out which one was the account ID and which one was the amount of spend because the numbers are so big in both cases. But adding a little bit more context, I think would be helpful.Jesse: I'm also really curious what sets the severity level for each anomaly. So, when you log into the anomaly dashboard, you see the table of recent anomalies. And each row—each anomaly—has a severity level associated from low to medium to high. I'm really curious what sets that severity level. We couldn't find this going through the existing documentation or in the walkthrough wizard. I'm not sure if it's something that I'm missing in the documentation, or if it isn't clearly documented yet.Pete: Yeah, it’s a good point. We can make some guesses. Are they higher spend numbers than lower? Are they numbers and spends that are closer to our alert threshold? It's really hard to say for us but definitely would love a little bit more insight in the documentation there. The other thing, too, that I think we—kind of waiting for more anomalies is—we don't really have a good answer for is how quickly do the alerts show up when this anomalous spend happens? As everyone probably knows, your spend within Amazon is laggy. It could be measured in hours—or even days—for some charges to post through. So, how long does it take for this anomalous spend to occur before the ML picks it up? That's something where I think just as more people use it, as we use it some more, we can kind of see some examples. And hopefully, with our threshold being set so low for alerting, when we see a message dropped into our Slack channel, we can go analyze it right away. I think it'll be pretty cool to see how quickly that happens. If for some reason we are messing around with Athena, and S3, and QuickSight again, are we creating new anomalies? And how quickly from the time that I might be messing around in those services to the alert being posted? If that's measured in hours, like that could be pretty interesting.Jesse: And this goes back to one of my previous comments, which is we still have lots of digging to do on this service. We've got multiple products leveraging AWS, so we definitely want to enable a monitor for each of our product tags so we can get a clearer idea of spend by product. And we want to dig into some of these anomalies that we already have seen; we want to dig into them further, we want to better understand where they're coming from and what's causing them, and we want to make sure that the Slack integration, or the SNS integration, is working as expected, that we can receive these alerts clearly and effectively, and just really continue testing all of what we're seeing so far.Pete: So, Jessie, are you saying that there might be a, “AWS Cost Anomaly Detection, Part Three: Electric Boogalee?”Jesse: Oh, my God, don't even get me started. I'm sorry, folks. I'm just going to put in my resignation now.Pete: All right. Well, thanks again, Jesse, for taking us through AWS Cost Anomaly Detection, and all the fun stuff we found. Really appreciate that.If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us, how many anomalies did you find? Thanks again. Bye-bye.Announcer: This has been a HumblePod production. Stay humble.