The Unconventional Guide to Cost Management: Architectural Context
Check out the full unconventional guide here!TranscriptCorey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if wanting new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.Pete: Hello, and welcome to AWS Morning Brief. I am Pete Cheslock.Jesse: I'm Jesse DeRose.Pete: This is Fridays From the Field. Triple F.Jesse: I feel like we've really got to go full Jean-Ralphio, Parks and Rec there. “Friday From the Feeeeeeeeeeild.”Pete: Yeah, so we're going to need to get an audio cut of that and add some techno beats to it. I think that's going to be our new intro song.Jesse: [imitates techno beats].Pete: Yeah, we're going to take both of those things. I'm glad we got this recorded because that's going to turn into a fantastic song. So, we're back to talk about The Unconventional Guide to Cost Management. And this is the first episode, this is the first of a whole slew of these that we're going to be going through from the field, these different ways that companies can impact their spend. And no, it doesn't mean go and buy the cloud management vendor of the moment to look at your spend or fire up Cost Explorer. Those are all pieces of it, but broader things, the big levers, the small levers, the levers that don't actually go back and forth, but you turn and you would have no idea because it was designed by an Amazon UX engineer.Jesse: Yeah, it's really important to call out that this discussion is looking at your cloud spend from a broader perspective and if you didn't get a chance to listen to our episode from last week, we did a little bit of an intro, framing this entire discussion. Go back and take a listen, if you haven't yet. Really talking about why looking at cloud costs through these different lenses is important. Why are you thinking about cloud cost, not just from the perspective of, “Oh, I'm going to delete these EBS snapshots,” or, “I'm going to tag all my resources,” but why is it important to think about cloud costs from other mediums?Pete: Exactly. So, don't forget, you can go to lastweekinaws.com/QA and put your questions right in that box. Your name is optional. You can just leave your name blank if you don't want anyone to know who you are. Or if you want to say something really nice about me and Jesse, and you just feel a little shy—Jesse: Aww.Pete: —that's fine, too. But just put a question in there. And we're going to dedicate some future episodes to answering those questions and diving a little deeper for those that want to know a little bit more. But as being the first episode, we got to talk about something, so what are we talking about today, Jesse? Jesse: Today we are talking about architecture and architecture context. Now, this is a really, really interesting one for me because the first thing that I think anybody thinks about when they think about cutting costs with their AWS spend is architecture decisions: something related to your infrastructure, whether that's tearing down a bunch of resources, or deleting data that's lying around. But there's a lot more to it than that context is everything. Knowing why your infrastructure is built the way it is, knowing why your application is designed the way it is, is really important to understanding your AWS cloud costs.Pete: This is where I feel like the Cloudabilitys CloudHealth, CloudCheckr Cloud-whatever companies, their products, sadly, fall down. And similar for every Amazon recommendation engine inside of AWS, they all break down. They lack the knowledge and the context of your organization. I remember a really long time ago, I had installed CloudHealth for the first time, and it said, “Hey, we've identified all these servers. They're sitting idle. Do you want us to turn them off for you?” Those servers were actually my very large Elasticsearch cluster. They were idle because if no one's querying them they don't do anything, but they sure do hold a lot of data, and they really do need to be available. So, please, please don't turn those off. But that same thing could happen if you were—you know, due to risk or compliance reasons, you had to run some infrastructure as a warm standby in another availability zone or region. Yeah, sure, it's not taking requests, it’s not doing anything, but that doesn't mean that it's not supposed to be running.Jesse: And this is really getting at one of the first big ideas, which is: work with other teams within the company. Not just other engineering teams, but product teams, possibly also security teams to understand all of the business context for your application and for your infrastructure in terms of data retention, in terms of availability, in terms of durability requirements. Because ultimately, you as a platform engineer, or an SRE, or a DevOps engineer, or whatever the hot new title is going to be a year from now, you need to understand why the infrastructure exists, and you may see servers that are sitting around idly doing nothing, but that's your disaster recovery site that is required by the business, by a service level agreement to be available at a moment's notice if something goes wrong. And so it's really important to understand what those components are and how they work together to build your overall application infrastructure.Pete: Yeah, that's a great point. I mean, having that knowledge that if you've been at a company for years, you've got a lot of this historical knowledge. People have come and gone, they've come, they've done things, they've implemented items, they've brought new features, they've gone. As companies grow may or not— may not be a single person who really truly understand the impact of various changes. I think we saw that most clearly when Amazon had their Kinesis outage: the amount of different services that were impacted was pretty large because it's just all too big for any one person to understand. But that doesn't mean that you shouldn't always continually be working to understand those different usage requirements, and chatting with the non-tech teams. Product teams, I feel like are often ignored in startups because you don't really want more work, and that's what those product teams normally do, right? But they're going to have a lot of context. I remember working in SaaS companies and looking at things like, “This? We don't use this anymore. There's no way we use this. I'm going to turn this off.” And then, I then say, well, the smarter minds prevail. I say, “Well, let me go talk to product people.” And they go, “Oh yeah. We can't get rid of that one super important API because this one client of ours paid us an obscene amount of money to make sure that we always support it.” It's like, wow, dodged a bullet on that one, right?Jesse: Yeah. And I feel like this gets at another important idea, which is the idea of communication and removing tribal knowledge, removing information silos. Really make sure that this information is communicated to everybody, whether that is in written documentation, whether that is in verbal communication— actually, it should probably be in both, ideally— to make sure that everybody understands why your architecture is the way it is so that they have the context to know that that server that's sitting idle is worth keeping around, or that API that never gets any requests is kind of important, and you can't just get rid of it.Pete: Yeah now that in the cloud world, you need to pay money every time you do every little thing—transfer data, provision servers, I/O costs— overlaying those types of costs into your architecture diagrams, you know those architecture diagrams that you were told to create and keep up to date five years ago?Jesse: We absolutely still have those.Pete: Yeah, we totally have. They're super accurate. But make those architecture diagrams work more for you by having more information in them. Don't just put little lines that connect every server to every other server. That's not helpful to people, maybe map out the data flows, and the volume of data flows between different applications or the consumers of them. And then for just true next level cost management experience in your organization—Jesse: [singing] Ahh.Pete: —put some dollar amounts, and if they're close to accurate, that would be neat, too. They don't even need to be a number. They could be a percentage, too, you could always go look at later. That information, it now you have this architecture diagram that can be used by a whole slew of people. It can be used by finance so that they can look at it and say, “What's the most expensive part of our infrastructure?” And it could be used by product teams to understand how things interact with each other, and engineering teams, how to debug when things go wrong, versus just boxes with lines.Jesse: Yeah, I think this also gets at a really interesting point, looking at all of these components of your application architecture and your infrastructure architecture in diagrams, that will also let you understand where your data is going, how much of your data is moving from place to place. And we'll talk about this later in a different episode, but this gives you the opportunity to really better understand the data flow from resource to resource, from microservice to microservice, and understand how much data you're actually moving around and why that data is moving, or where that data is moving.Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.Pete: Yeah, to put this into a bit more perspective, we do a lot of cost optimization projects with our clients where we really dive in, technically, into the architecture, chat with their engineers to understand how and why they chose certain services. And there was one client of ours who had a lot of spend, but it was hidden inside of the movement of data, and the duplication of data by reloading Aurora RDS instances from nightly data dumps into S3 and pushing this data through Kinesis into multiple Kinesis streams that would be consumed by, kind of, various downstream applications. And it was not the engine of all of these that cost them the most money; it wasn't the fact that they ran a certain size of Aurora, and it wasn't even the data storage costs. It was all of the I/O, and all of the data movement, and network charges that were really starting to add up. They didn't have any data flow diagrams that showed data moving from these different places. If they did, they would have really clearly seen all of this duplication that was going on and they could have been able to resolve this in a more timely fashion. But they're no different than a lot of other clients out there: they don't have this central place that they can put this information. I mean, in their example, by moving towards more of a data lake type model, versus just continually dumping into reloading databases, but also even a Kinesis world, you can have these fan-out streams so that you don't have to have multiple duplicate streams and copying data from one place to another. But those data flows can expose a lot of spend that might really be hidden. I always like to talk about how moving data is so expensive, and we actually dedicate a whole episode just to talk about that concept, but that's how you can identify those items, those spend items being so large in your organization, is that why; that context.Jesse: Absolutely. And that is why context is so critical when building your application architecture diagrams and your infrastructure diagrams. You really want to make sure that you understand how your portion of the application works, but also how all the other portions of the application works so that, ultimately, everybody can be on the same page together, and everybody can help each other accomplish those goals and really help the business move forward together.Pete: Yeah. And it doesn't even stop at just accepting what you find. Asking these questions and getting this information, I mean, you're asking yourself, do we really need this cluster to be replicated across four availability zones or five availability zones. Would our risk tolerance allow for three? Would it allow for two? I mean, I saved a bunch of money in the past, I used to replicate a Elasticsearch cluster across three AZs. That was just our standard. Everything we deployed, went across three AZs. And then we started looking at Elasticsearch and going, “Well, we're only running a primary and a secondary replica shard. So, there's only two shards, so why not just have one in one AZ and one in—what's the point of the three AZs?” And we were able to cut out a large amount of spend by better understanding and asking those questions. Asking why; getting the context because that's the thing too is there were a lot of decisions—especially at high growth startups that I've been at—that decision was a great idea, four years ago, and it was critical to the success that got the company to survive long enough to even ask these questions now. So, just because, again, someone wrote it down four years ago, and that's the way it was done then, you might have to reassess that. I think also, too, when it comes to different things related to multi-region or multi-datacenter strategies, sadly, a lot of folks haven't really caught up on Cloud yet, and so they might say, “Yeah, you have to be multi-region.” It's like, well, what's the actual requirement? Does it require separate network endpoints and separate physical availability type zones? You know, the availability zone are separate disaster domains. I mean, they're to be shared in their regional power and network and things like that, but in a lot of cases, that requirement far exceeds these legacy multi-datacenter worlds that used to exist. So, asking those questions, getting that context helps really drive down to the crux of the spend. And then you can really impact it to the business and move it, hopefully, in a better direction. Well, if you have enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating, and then right after, go to lastweekinaws.com/QA and give us a question. We look forward to answering that in a future episode. Thanks.Announcer: This has been a HumblePod production. Stay humble.