Kubernetes is the Most Expensive Way to Run a Service
TranscriptCorey: Software powers the world. LaunchDarkly is a feature management platform that empowers all teams to safely deliver and control software through feature flags. By separating code deployments from feature releases at scale, LaunchDarkly enables you to innovate faster, increase developer happiness, and drive DevOps transformation. To stay competitive, teams must adopt modern software engineering practices. LaunchDarkly enables teams to modernize faster, Intuit, GoPro, IBM, Atlassian, and thousands of other organizations rely on LaunchDarkly to pursue modern development and continuously deliver value. Visit us at launchdarkly.com to learn more.Pete: Hello, and welcome to the AWS Morning Brief. I’m Pete Cheslock.Jesse: I'm Jesse DeRose.Pete: And we're back yet again. We're well into 2021. I mean, about a week or so, right?Jesse: I'm excited. I'm just glad that when midnight struck. I didn't roll back over into January 1st of 2020.Pete: Yeah, luckily, it's not a Y2K scenario. I don't think we have to deal with the whole date issues until, what, 2032 I think, whatever that the next big Y2K-ish date issue is going to be. I'm hopefully retired by the time that that happens. Jesse: That's future us problem. Pete: Yeah. Future us problem, absolutely. Well, we've made it. We've made it to 2021, which is a statement no one thought they were going to say last year at this point.Jesse: [laugh].Pete: But here we are. And today, we're talking about an interesting topic that may bring us some hate mail. I don't know. You tell me, folks that are listening. But we're seeing this more and more in our capacity as cloud economists working with clients here at The Duckbill Group, that folks who are running Kubernetes—whether it's EKS, or they're running it on EC2 using maybe, like, an OpenShift—are actually spending more than people who are using other primitives within AWS. So, we wanted to chat a little bit about why we think that is, and some of the challenges that we're seeing out there. And we would love to hear from you on this one. If you are using Kubernetes in any of the ways that we're going to talk about, you can actually send us a story about how you're doing that and maybe answer some of these questions we have, or explain how you're using it. If you go to lastweekinaws.com/QA to ask us questions—not quality assurance—but go to QA for asking us questions. You can put in your information, you can add your name, it's optional if you want. You can be completely anonymous and just tell us how much you enjoy our wonderful tones and talking about technology. So, Kubernetes. Why is this the thing, Jessie?Jesse: I feel like when it first came out, it was the hot thing. Like, everybody wanted Kubernetes, everybody wanted to be Kubernetes, there were classes on Kubernetes, there were books on—like, I feel like that's still happening. I think it has amazing potential in a lot of ways, but I also feel like… in the same way that you might read the Google SRE book and then immediately turn to your startup team of three people and say, “We're going to do everything the way that Google does it,” this isn't always the right option.Pete: Feel like the Google SRE book is, like, The Mythical Man Month, which is, the book that everyone wants to quote, the name of the book, but none of those people have ever actually read the book.Jesse: Yeah, there's lots of really great ideas, but just because they're great ideas that worked well for a large company at scale doesn't necessarily mean that they're going to be the same right ideas for your company.Pete: And also, we're both fairly grizzled former system administrators and operators; Kubernetes is not the first, kind of, swing of the bat at this problem. I mean, we've had Mesos which, it's still around but not as hip and cool; we've had OpenStack. Does—remember when all the Kubernetes people were all like, “Nope, OpenStack is going to be the greatest thing ever.” So, needless to say, we are a little jaded on the topic.Jesse: You can't forget about Nomad, either, from HashiCorp built cleanly into HashiCorp’s Hashi stack with all of their other amazing development and deployment tools. Pete: Yeah. I mean, this is a problem that people want to solve. But in the rise of Cloud, on Amazon I always struggled with why it was needed. And we're going to talk a little bit about that. So, again, what is Kubernetes? I hope people are listening that would know this, but maybe not. It's an abstraction layer for scheduling workloads. It's the solution to the Docker problem. Like, a container is great. I have a container, it is a totally self-contained application, ready to go, my configuration, my dependencies. And now I need a place to run it. Well, where do I run this container? Well, pre-Kubernetes, Jessie, you'd probably use something like ECS—the Elastic Container Service—might be a way that you could schedule some workloads. Jesse: Or maybe if you just wanted to run a single virtual machine somewhere and run that container in the virtual machine, you might do that as well. Pete: Yeah, that was how a lot of the earliest users of Docker were just running Docker: they were just running the containers as applications—because that's what they are—on their bare EC2. They would just run some EC2 and run a Docker container on there. And there were benefits to that. You got this isolated package deployed out there not having to worry about dependencies. You have to worry about having the right Python dependencies or Ruby dependencies. It came with everything it needed, and that was a big solution. Now Kubernetes, I think, brings this really interesting concept that I like. It's this API that theoretically you could use in a lot of different places. If you now have this API to deploy your application anywhere there's a Kubernetes cluster, does this solve vendor-lock-in? Could you use Kubernetes to solve some of these issues that we see?Jesse: You could use Kubernetes to solve vendor-lock-in in the same way that you could use multi-cloud to solve vendor lock-in. Again, it is a solution to the problem, but is it the right solution for your company?Pete: That is always the question I feel like I would ask folks when they were using Kubernetes is, I would always ask why they were using it. I honestly will say I never got—I don’t want to say wouldn't say never; that's not fair. I rarely would get a good answer. It was often like a little bit of operational FOMO—you know, the fear of missing out on the next hottest thing, which of course, that's never a good way to pick your architecture stack. Now, that being said, at a previous company, we were investigating Kubernetes to solve a problem with our stateless applications—because I in no way trusted it to run anything stateful. None of my databases I wanted on it. But it is a great way to put more control into my developers’ hands-on deploying their applications. We ran predominantly C class instances on EC2. And those C class instances were a CPU heavy data processing application, and so it seemed to make a lot of sense to get a lot more efficient bin packing, to let the developers be a little bit in control of how much memory and CPU they're going to allocate to there. But at the end of the day, we never ended up going down that path because, again, for us and our architecture, just continually running the correctly sized EC2 with some of the other abstractions that we made just made the most sense. But if I was in a data center, if I had legacy data center hardware, I think Kubernetes, that’s, like, the dream API for me. It's been years since I've been in a data center, but having that way of creating an API for my physical data center assets that could be a similar API—we're not going to say the same because we know that's not true—but a similar API in a cloud vendor, like, that can be pretty compelling, right?Jesse: Yeah, it really does get to that idea of abstracting away the compute layer from the developer so the developers don't have to worry about, “What kind of infrastructure do I need? What kind of resources do I need? Do I need to provision this C class instance on-demand? Do I need to provision an M class instance, on-demand? Do I need to provision this other thing on-demand? And what does all of that look like?” Ultimately, you can give the developers the opportunity to say, I know that I need this much vCPU credits—or units—and this much memory. I don't care where it runs past that. Go. And let them, again, focus on their development cycles more so than the infrastructure component of the application. I think it's a great opportunity.Pete: Yeah, absolutely. Anytime you can have an engineer just ship their application without having to worry about what's behind it, that's a win. That's why services like Lambda and services like Fargate, you just push the thing to the Cloud and let it run, are really effective ways of unblocking those developers and not having to deal with any sort of, kind of, operational thoughts, [laugh] for lack of a better term. They can just push those applications. But let's dive in a little bit Amazon-specific, right because we were talking about how Kubernetes is the most expensive way to run a service. But I would say: in Amazon. That's the catch here because there are so many ways of running things in Amazon. One thing that we see, often, is—we see it internally into our clients a lot is, the more ephemeral you can make an application, the cheaper it will be to do that. So, not to put you on the spot Jessie, but what does that really mean when we say that? When we say that ephemerality is kind of a driver of cost?Jesse: Yeah. Ultimately, we see a lot of clients who treats AWS like another data center. They treat AWS, the same way that they would treat a bare-metal physical server sitting in a data center somewhere, which to some extent, arguably, it is, but when you think about development and resource usage, ultimately, AWS provides so much more than a data center in terms of cloud-native resources. And when I'm talking about cloud-native resources, I specifically mean things like ECS, Fargate, Lambda, where developers can quickly deploy and iterate on an application without relying specifically on this massive compute infrastructure under it. So, ultimately, a company is able to deploy things in such a way that code is only around for as long as it needs to be around. This is moving towards the idea of request-based workloads, especially with Lambda, where a piece of code for a workload will be fired whenever a request comes in, and then fire whatever other code it needs to do whatever it needs to do in terms of its workload, and then it's done. That's it, and you only pay for the amount of time that that request took. Whereas, when you are running workloads on an EC2 on-demand instance, you're paying for the entire amount of time that that instance is online and running, whether the application is actually performing any requests or not.Pete: Yeah. I mean, it seems common knowledge nowadays, but I'm still going to repeat it is that with the exception of a T class instance, if your CPU usage is anything below one hundred percent, there is technically waste, right? There is usage you're not using. Now, obviously, it gets a little more complex: you've got memory, maybe you're using all of the memory because you've got some gnarly Java application and very little CPU, but then, in that case I'd say, yeah, you're still probably using the wrong instance. Move to a T class instance. But again, your mileage may vary. There's a reason there's hundreds of different EC2 instances because there's so many different workloads. But yeah, that goes back to that statement, Jesse, like you said. The more ephemeral your applications, the more they can survive any sort of workload interruption—leveraging spots and things like that--, the more inexpensive it is. The Kubernetes story that I think is interesting because we've seen this now quite a few times, especially across this past year is, when it comes to EKS, you're now abstracting away the instances underneath there. Well, I think one question I always ask is, “Well, how many clusters are you running?” All of your applications don't currently run on a single instance type. Do you run one cluster with a series of instance types, and start scheduling people to instance types based on their workloads? What about your stateful applications? Have you brought those over yet? I have a hilarious story from a friend of mine who was doing a very large-scale Kubernetes deployment where one hundred percent of applications must go to Kubernetes. That was the initiative. And as it turns out, developers, they don't know what to put in the YAML. They don't know all the things they need to fill out. And so, they would deploy their database cluster with their application, and then it would get rescheduled somewhere else, and they'd be like, “Hey, where's my data?”Jesse: Oh, no.Pete: Well, they forgot to make the disks persistent. They forgot to do that setting. But, Jesse, I think, to your point, you had mentioned before all these different services that you can use that is giving you kind of a Kubernetes-like experience. The one place where they all fall down is the simplicity of a YAML file that you define, and you just ship it off, and your magic happens. That has, I think, the thing that has made Kubernetes win is like, “I just want to do this YAML and make it work,” versus, “How do I deploy a Lambda function in 2021?”Jesse: Yeah.Pete: Crickets. I mean, you'll either have like, “I don't even know where to start. Do I need Terraform? Do I need CloudFormation? Can I use the Amazon CLI? Can I click around through the UI? Am I really going to let my 1000 developers do that?” Way too many question marks versus, well, I have a YAML file that I can ship via my CI system. That is pretty compelling.Jesse: I think it's also worth calling out really briefly when we're talking about workload types versus stateful applications. We've seen both stateless and stateful applications run on Kubernetes or run on containers, and we've seen issues like Pete mentioned with his friend’s story of data gone missing because an EBS volume didn't persist, but I just want to highlight, really quickly, please don't run stateful applications on your Kubernetes infrastructure or on your container infrastructure to begin with. I know that there's a lot of benefits to that on the front end; there's a lot of shiny red bows, but there's so many other things to think about that aren't highlighted up front, that are kind of hidden behind the scenes, which I know we'll get to, but it is rarely the right solution for stateful workloads.Pete: You know, just a reminder to the folks listening, you can go to lastweekinaws.com/QA and give Jessie your feedback. [laugh].Jesse: Any time. [laugh]. I can't wait to see all the hot takes on that one. Pete: Yeah. I’m going to agree with you on that one. I don't know if I would personally trust a stateful workload in Kubernetes versus somewhere else. Again, speaking Amazon-specific. If I'm going to run a SQL cluster, it's going into RDS; that's going to be my abstraction layer. Elasticsearch, I've had way too many years of experience with Elasticsearch, I would feel like I would have more control running it on my own EC2. And honestly, if I had very little experience with Elasticsearch, I would use an Amazon Elasticsearch service, right?Corey: If you're like me, one of your favorite hobbies is screwing up CI/CD. Consider instead looking at CircleCI. Designed for modern software teams, CircleCI’s continuous integration and delivery platform helps developers push code with undeserved confidence. Companies of all shapes and sizes use CircleCI to take their software from bad idea to worse delivery, but to do so quickly, safely, and at scale. Visit circle.ci/screaming to learn why high performing DevOps teams use CircleCI to automate and accelerate their CI/CD pipelines. Alternately, the best advertisement I can think of for CircleCI is to try to string together AWS’s code, build, deploy pipeline suite of services, but trust me: circle.ci/screaming is going to be a heck of a lot less painful, and it's where you're ultimately going to end up anyway. Thanks, again to CircleCI for their support of this ridiculous podcast.Pete: So, I feel like a lot of the scenarios that there's rare scenarios where, again, on Amazon specifically, you would want to run a stateful workload. I think the biggest thing that we have seen has been because of most people are deploying these Kubernetes clusters across availability zones, they're incurring significantly greater charges in that data transfer. Amazon charges for every type of data transfer there is with very few exceptions. And cross-availability zone data transfer continues to be one of the places our customers spend the most amount of money when it comes to data transfer. Oftentimes, they don't actually know what is taking up most of that service. So, I imagine this scenario: you're moving from Cassandra—and I have a hilarious remembering story of an account manager years ago who told me, “I could look at your Amazon bill, and I could tell if you're running Cassandra or not because of that data transfer spend.” And again, I laughed at the time, and now, having seen hundreds of Amazon bills, yeah, I could do the same thing. I could absolutely tell you if you're running Cassandra because it's replicating its rights, it's doing all these rights across every AZ. It's incredibly expensive to run something like that. But imagine how much harder it is to figure out where your data transfer spend is going when you have a variety of workloads in the shared cluster, now. Your Amazon and EC2 tags are probably going to be largely irrelevant, so then how are you going to track the spend of your applications on Kubernetes? And again, please go to lastweekinaws.com/QA. Tell us how you're solving this now or even if you're thinking about it because I've yet to see any solutions out there that truly solve this problem of defining, and understanding, and tracking your unit economic spend down to that Kubernetes level. If I could just grab a tag of my Cassandra cluster, I'm going to see all the costs associated with that tag, like data transfer, and EBS, and things like that, but how is that going to happen when it's on a Kubernetes cluster?Jesse: I feel like so far, we've actually only seen this problem solved by tooling, whether that is a third-party tool or an in-house tool, that is specifically designed to look at your Kubernetes clusters and give you insights about usage: what is over-provisioned what is under-provisioned, et cetera. There's no easy first-class citizen that I've at least seen so far that really gives us this information, either within Kubernetes or within some of the AWS specific services like EKS and ECS.Pete: Right, yeah. I could imagine that if you were a enterprise that is very mature in your cloud cost management—let's say you have a really high percentage of tags across your infrastructure, and you're able to really granularly see spend by product, by team, by service—I was doing a lot of this stuff at many of my previous places, really analyzing that spend down. My biggest fear of Kubernetes is losing that visibility of—I’ve got a Kubernetes cluster of, I don't know, 10 EC2 instances, how do I figure out which applications on that cluster make up what percentage of that spend? Would I break it out by CPU used? What about memory used? Just, how would you allocate that spend across that cluster? Or would you then say, “Oh, well, maybe I'll break out a cluster based on each… product? Engineering team?” Where does that go?Jesse: Not to mention, this is assuming that all of the infrastructure has the appropriate user-defined cost allocation tags associated with it, whether that is EC2 instances or EBS volumes. Because in a lot of cases, that's the other problem; maybe the cluster is all tagged with the tag cluster-1 because the infrastructure team deployed it and they know to use user-defined cost allocation tags. So, it's all tagged as cluster-1 or microservice-1, but within that, who knows? Maybe the developers are aware enough to deploy tags through their deployment pipeline through automation. Maybe they're not. Maybe they need to attach EBS volumes and those EBS volumes aren't getting user-defined cost allocation tags associated with them. There's a wide variety of gray area there.Pete: Yeah, exactly. These are questions that any team should be asking themselves before they undertake any new kind of platform deployment is, understand how they're going to get answers to these questions because even if you don't care about the question, right? You're an operator, you want to spin up Kubernetes because you want to get that resume beefed up for that next role, right? Like, I’m a Kubernetes admin. Like, that's an instant 20 percent pay bump at the next place. Jesse: Yeah.Pete: [laugh]. But the reality is, is that at some point, someone at a higher pay grade than you is going to be like, “Why is our bill so high?” And you're going to look at this Kubernetes cluster and go, “I don't know how to answer this.” So, any advanced planning you can do to understand how to allocate that spend across however many applications, the better off you'll be in the future. One of the best things that I've learned recently is the concept of taking a cluster, breaking it into units, almost, where you'll have maybe a certain instance type you're going to use for that entire cluster. If you start breaking that out to the cost per vCPU, cost per memory, and you almost break it into these usable chunks. Then each usable chunk, this cluster of 10, maybe breaks into, like, 20 applications of a very specifically defined size. Now you have a unit that you can apply, even to your applications of different sizes. It's almost like a normalized unit like you might do with an instance reservation. But breaking them down to that world. To be honest, I'm not sure how you would handle that if you had an Kubernetes cluster of different instance types. Unless you did, again, some additional tagging in there. And you can see, the complexity gets pretty high here, especially if you're not planning for it before you deploy it. So, that's probably the best advice that I could think of is, plan in advance about how you're going to figure out who is the largest consumer? Are you going to do it by monitoring maybe the YAML level? Are you going to try to front-load additional tags? Or maybe use a third party, maybe use your monitoring system? I'm not sure. Jesse: Absolutely. Pete: These are all helpful things right, Jesse, that eventually someone in finance is going to come over to you and say, “What is this?” Jesse: Absolutely, I think it's worth noting that Kubernetes not only has the overhead cost of managing the infrastructure—if you choose to manage an open-source version on EC2 instances on-demand versus something like EKS—but there's also overhead in terms of managing the optimization of workloads on it, and managing the attribution of costs for workloads on it. So, be mindful that it's not just expensive in terms of the amount of money you're actually spending on AWS, but the amount of time you're spending on this infrastructure as well, from an engineering perspective.Pete: Well, I can't think of anything that is more insightful to say than that, which usually means we've reached the end of this podcast. So, if you have enjoyed this podcast, please go to lastweekinaws.com/review, and give it a five-star review on your podcast platform of choice whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us why you love Kubernetes so much. Don't forget, we would love to hear your questions about anything related to cloud cost management, how you do cost allocation on something like Kubernetes. Go to lastweekinaws.com/QA, shoot us a question and we will pull those together and we will answer them on the air. Thank you very much.Announcer: This has been a HumblePod production. Stay humble.