Kubernetes prevents downtime ... until it doesn't
ACS, AKS, and our first significant downtimeDifferences between Azure Container Service (ACS), and Azure Kubernetes Service (AKS) and the first significant downtime we've had at medit with Kubernetes
Hello and welcome to the podcast, I’m your host Dave Albert. In the show, I talk about technology, building a company as a CTO and co founder and have guests to discuss their roles in technology and entrepreneurship.
Hey folks, in this episode, I’m going to talk to you about Microsoft Azure, Kubernetes ACS, AKS and the first major production outage that we had at Medit. So I guess I’ll start at the beginning, approximately a year ago and maybe maybe it’s 14/15 months I decided because of our BizSpark plus credits with Azure that it made a lot of sense to move all of our infrastructure into Azure and with that Kubernetes seemed like a really good way to go because of the value of Kubernetes. Docker containers are easy enough to use portably there are some production considerations with Docker and Kubernetes actually helps sort out a number of those. Kubernetes does scheduling for you, so I don’t necessarily mean time based scheduling but that you require a certain number of have containers or in Kubernetes case pods to be running at any given time. So for our API three pods right now, for our let’s Encrypt certificate, Damon, that’s just one, engine X we have two, I think maybe three. And those will continue to scale up as we need. So these are the actual actual applications running underneath the Kubernetes. So I guess we’re actually running on top of the engine of Kubernetes. So this is how our application is delivered to the world.
If one of the applications crashes, Kubernetes manages that and automatically restarts a new running pod which that basically means we don’t have to worry about constant monitoring. I mean, we do that, but we don’t need to constantly be worried, is it up? Is it up? because if it’s not, then Kubernetes will fix that.
Obviously, if there’s a core problem within the code, then you have to sort that out. But in general, if there’s an unexpected exception or crash, then Kubernetes just manages to restart your application. So Kubernetes is a great way to go. We wanted to go with AKS, which is Azures Kubernetes service. I think it used to be called Azures managed containers service, but it’s Azure Kubernetes service now. 15 months ago I tried to go with that, the problem was at that point it was in preview as opposed to generally available status. So that meant that you couldn’t really use it in production. I’m not sure exactly when it moved to general availability. I think that may have been in the last four or five months. So we went with ACS which was Azure Container services, which allows you to use either Kubernetes, DCOS or swarm maybe I can’t remember exactly which of the options were available, but now they see us as being deprecated. We’re currently on ACS because as I said AKS was in preview mode and that’s not any way to run a production system and in fact, there was about a week of downtime with AKS back the 14 or 15 months ago, which is when I then moved to ACS instead of AKS. So fast forward to last week. Sorry about the microphone noise there. I think I’ve got it settled. Now. Last week, we had actually two weeks ago or December in December, our container service went down. That’s the where we store our docker containers, which is the Azure Container Registry. I think, ACR I believe that’s correct. There’s so many different cloud services between Azure, AWS and Google Cloud that it’s hard to keep track, keep it straight. So the registry went down in December had open a ticket figured out. Turns out it was a subscription expiration date or something like that. It was basically just a one liner that sorted it out and with the Azure command line client and I went ahead and said it for like 20 years or something like that, so that it’s never going to be a problem again. Now I need to go back when we get a little bit of time and make sure that everything is set in a way that we would want it to be but it’s very limited number of people who have access to to the Container Registry and there’s really nothing to to change there it means that ACS can contact the registry, that’s all. So we weren’t able to pull new containers
that’s fixed. Now, last week, we have our MongoDB running in Kubernetes it’s not optimal, but it uses persistent volumes in a stateful set. So to this point, we haven’t had any major problems again, if if database runs out of memory or has other issues, then Kubernetes will schedule a new pod and restart and its backup men in seconds usually minutes maximum if there’s if it’s pretty far out of sync. The biggest problem we’ve had is that I’m not 100% sure how the stateful set readiness is done so that that’s how Kubernetes reads if a pod is ready, our stateful set says we need three state full or three ready pods. The state full set has a some sort of priority setting in it. So it was from a home chart we used, we took and configured for ourselves the stateful set MongoDB chart, Helm chart that you can find on GitHub, they’ll buy but I mean, it’s the it’s the standard standard one. So pod zero has to come up before one will come up has to come up before two will come up. Sometimes the pods get in a weird state where it may be serving traffic and working perfectly fine, but it says it’s not in a ready state. And what can happen then is if say zero is not in ready state and two goes down and two won’t come back up because it’s waiting for one or zero which ever, if a lower number pod is not showing ready, and then a higher number pod goes down, it will always come back. So typically you just delete the pod that’s not in a good state, making sure to to monitor that everything appears to be working properly as that is restarted and just ensuring that there’s no spike in traffic going on at that time, and that’s been fine.
Then last week, we noticed that pot number three was in a bad state we deleted it, as usual didn’t come up but we saw that zero was also in a bad state. But zero was the master which is why we didn’t delete it first. And we got into that situation where three wouldn’t come up because I think three was actually zero, one and two same three but you know index zero vs index one off by one errors you have to love them so two we deleted zero was in a bad state so we had to delete zero as well it was the master I was monitoring one which was the middle of it never promoted to primary so that an RS re-configure the setting it as the highest value priority for the replica set so that it should have immediately become the master primary. It didn’t. We ran a number of other tests and tried to promote it and it wouldn’t come up so then, I opened a ticket with Azure because it didn’t seem like it was actually a Mongo problem. I think at some point when we realized we were actually down already, we tried to delete the pod number one. So we had zero pods, but we had zero pods that we’re serving as a primary. So there was no right and so we weren’t actually working we couldn’t do anything through further investigation we found that the a certificate had expired now it was difficult to troubleshoot what certificate that was. In Kubernetes with ACS there are a few different config files and etc. Kubernetes I think there was another one in far run Kubernetes and maybe one other place. I can’t think of where it was exactly, but it doesn’t really matter. But there was also the the certificate for Mongo which is what I thought the issue was then eventually we found the far run Kubernetes client certificate and ran openSSL -in that’s actually “openssl x509 -in the certificate name -noout [space] -dates” and that gives you the before after dates for when the certificate is valid and that was the actual certificate that had expired. It was the TLS certificate for the API of Kubernetes. So there’s a lot of different certificates floating around between communication between the master and the agents the before the application from the you know, the TLS certificate from Let’s Encrypt the certificate that’s created from Mongo, there were just it made no sense to us where that certificate even came from. None of the configs mentioned it. And I still had to wait for Azure support to help identify how certificates were created because I didn’t know what ACS is automation entailed. And you know, once you manually touch something within an automation system, that system is probably broken. Not always, but often. So I didn’t want to go and make changes until I understood what our actual problem was. And it turns out that Kubernetes on creation, on first run, if var run Kubernetes client certificate doesn’t exist, and you don’t pass the parameters, and it’s not in the specified config that you’ve given to create your Kubernetes cluster, it will automatically create a self signed certificate and use that. So we finally found the certificate, why was created there and was told that basically it was a Kubernetes issue and not an Azure issue because it was ACS had been AKS tt would have been Microsoft’s issue to deal with.
So ACS gives you, it basically just is a set of scripts that are executed that create a master node and then agent nodes. So the master is the one that tells the agents what they’re supposed to do. And the agents are where your actual current and these containers run. With AKS you just get agent nodes and it’s handled invisible to the user, the master node part of this. So we also found that we were on Kubernetes 1.7 which we’re in the process of trying to upgrade to a much newer version through AKS but I’m going to sort out some database elements first before we worry about that to make it easier to do the move. It, sorry I I forgot what I was saying. But so old version. So I think we’re on 1.7. In 1.8. There’s a command to recreate the certificate. The support at Azure gave me a GitHub issue that had some documentation on how someone went and use went through the process of upgrading basically, this long process to create the certificate. I read through the process and saw that it was one of the steps was you had to actually reboot the master node host. And so we had hypothesized that since the certificate the self signed certificate was created when it spun up when Kubernetes spun up because it didn’t exist that if we deleted it or moved it actually move the old certificate and then restarted the correct Kubernetes process that that would sort out the problem. So once we found out that the only advice they were giving us which they wouldn’t stand behind just this appears to be what you’re looking for. If you are interested in trying it, then you might have luck with it. But beyond that, because it’s a Kubernetes issue and not an Azure issue. We were kind of on our own to sort that out. So move the certificates restarted the host and everything started to come back and then we were live again. Most of the time was spent waiting to figure it out and it took some time but yeah, it’s all back now we know what to do before our test system expires. So our production system for ACS was created before the test system for ACS, because it was pre launch and test was still AKS or actually, I think test might have still been standalone VMs. So that was our downtime, that was some of the issues with ACS. Still, ACS has been good to us. It has had some issues but like one of the bigger biggest issues is that we’ve wanted to upgrade Kubernetes for a while but there isn’t really a good path for that now for AKS basically just make a change in the portal and wait while it does rolling updates, which that’s fantastic and that’s why I wanted to use AKS from the beginning. But unfortunately, I was just some number of months too early.
So yeah, I’m sure there are people out there asking what about your other data center? Yes, indeed. What about the other data center? Well, because we’ve been in this process of migrating to a chaos and figuring out our, our database needs to the deal is that there’s not a great way to do VNet peering, which is basically how you get a VPN from one data center to another in Azure. There’s no good way to make that work between different aka clusters. So have a Kubernetes cluster and your West and Kubernetes cluster in Europe North or Western Europe and North Europe to have say say I’ve got three of those and my I’ve got a Mongo and one of each, for them to communicate obviously you would want to do that over a VPN or using VNet peering because of the way the VNet peering is handled and the gateways and virtual I can’t remember what it was called. So I don’t have it in front of me. Sorry about that. If you want more information on that particular bit hit me up on Twitter and I’ll let you know. The way that networking is built, you can’t transfer traffic from one Kubernetes pod through all the different layers into a service in another data center using VNet peering. You could do it if it was a publicly available service. But you wouldn’t want your database to be a publicly available service. That’s the point of VPN or VNet peering in this case. So I’ve been trying to figure out the right way to do this without I don’t want to maintain hosts if I don’t have to, even though the data is persistent, the servers don’t actually need to be persistent for the database. So instead of having to maintain patching and all the other elements The that are involved with maintaining a VM. I’m happier to let Kubernetes do it. I’d be much happier to let Cosmos DB do it, which I’ve had an episode about that in the past. And I think now we may have a path forward with that over the break for the holidays, the end of the Year holidays, I was working on another project and tangential to Medit and I’m pretty sure I have a good way for us to use Cosmos in a way that is not prohibitively expensive. Probably in the next week or so. We’ll begin our first tests and start migrating some of the collections into cosmos. Slowly and specifically for the ones that are less critical because then once it’s in cosmos, then all of our data within Kubernetes is immutable are not data all our all our servers are immutable and there is no persistent data all the persistent data is in Cosmos which is automatically geo load balanced. So that’s probably the path to get us to a second data center. It’s been on on my priority list for a while, but there’s just been so many different things that have come up that have prevented it from happening but I think we’re we’re nearly there. And just another reminder this downtime which it wasn’t as long as it could have been if just deleting that certificate hadn’t fixed it. I’m not sure what we would have needed to do because I don’t know if we would have needed to know the exact CSR certificate request. CSR is that right? It does it doesn’t matter. So many acronyms in this industry, so the certificate request if we need you to know all the different elements exactly as they were from the existing certificate, which I suppose you could pull that out with openSSL, getting the text instead of just the dates, but I’ve never had to recreate a certificate request from a certificate so trying to do something like that for the first time. When you’re in the middle of a downtime never seems like a great thing to do. So it could have been way worse. And even that may not have solved the problem. But luckily deleting that certificate and restarting did and like I said, having this downtime now reminds me why we so desperately need to have that second data center and and the data external to a single data center.
So yeah, ACS gives you the master and agents, AKS gives you just the agents. The upgrade upgrade path with ACS seems to be non existent. ACS is also being deprecated at AKS upgrade path is simple it does rolling upgrades which basically destroys the host after the new host is available so that does your patching as well. So if you keep up to date with your AKS then we really don’t have to worry too much about any additional updates. Of course, this means you would need to be even more diligent with testing the upgrade in your testing environment. So, that’s our plan sort out Cosmos make Mongo unnecessary if Cosmos isn’t the solution, potentially Mongo’s Atlas is. Otherwise I’ll have to run Mongo replica set amongst three different data centers. And maintain those BMs I mean I can do it you know I’ve done things like that before it’s the with answerable it’s not the worst but the more infrastructure we have to manage the last time I can spend building and that’s where we get the most value at the moment for growth is through building additional elements, obviously excessive downtime we destroy any growth but this is the first downtime we’ve had in we released in March of 2018 and it’s January of 2019. So 10 months first time we’ve had more than a minute or two of downtime so overall Kubernetes he says made life a million times better, it makes deploying a million times better. There are always little things that they just you can’t know about until you know about them. So I hope if you’re using Kubernetes you you learn about the certificate and it helps you avoid any downtime just deleting that service or deleting that certificate the client certificate and restarting sorted it all out will actually moving it I try not to delete things unless absolutely no for sure that they’re never needed again. Yeah, I think that’s that’s pretty much everything I wanted to discuss on Kubernetes, ACS, AKS if you use any of those I’d love to hear more from you, what your experience is, any problems you’ve had and how you overcame them so as always you can email me [email protected] or on Twitter @_ nope, @dave_albert. Thanks for listening.
Until next time remember any sufficiently advanced technology is indistinguishable from magic.