Guest James Wynne

James and I talk about Ops Fire drills and exercises

Fri, 12 Apr 2019 04:20:54 GMT

After hearing a presentation about ops fire drills given by James, which was something already on my mind, I wanted to talk with James more about the topic.

[email protected]
https://www.linkedin.com/in/jawynne/
The Dungeons and Dragons talk is https://www.youtube.com/watch?v=WmizLzszopw

James and I talk about Ops Fire drills and exercises

Transcript:

Dave: Hello, and welcome to the podcast. I’m your host, Dave Albert. In this show, I talk about technology, building a company as a CTO and co founder and have guests to discuss their roles in technology and entrepreneurship.

Today I’m joined by James Wynne, an anchor of the CloudOps team at Pivotal here in Dublin. Thanks for joining us, James.

James: Thank you.

Dave: Can you tell us a little bit about your role, and yourself?

James: Yes. So, my name is James Wynne I’ve been working in infrastructure, infrastructure Professional Services, sysadmin type roles approximate last 20 years or so. Currently in pivotal, I’m part of the operations team that runs what we call a foundation for a major internal customer to pivotal. So we’re responsible for keeping our you know, for the normal sort of operations things upgrades, migrations, feeding, feeding and watering the software such

Dave: Okay, cool. Um, so I met James at a meetup recently where he spoke about ops fire drills and exercises to help keep your your operations up to speed and in shape for any real outage events. That was already a topic that had been something I’ve been considering. And actually since the event, I actually finally sat down and added a few routines to our slack bot. So we can now add exercises and drills. So thanks for that. So what gave you the idea to create that.

James: So there’s a couple of things I suppose, as I said, kind of, I would say for the last 15 to 18 years, I’ve been on call in some capacity or another. I’ve worked with obviously teams have been on call. And over those years I’ve seen people burn out I’ve seen people being ill prepared for going on call and so on. And BIOS, a year ago, year and a half ago, we had a it was a major initiative, which required our team to start going on call for the first time. So based on my experience and other customer or other companies, what I wanted to do was try to create a way that we can onboard engineers into how to go on call. And the way I did this was I came up with two two concepts. It’s kind of related to kind of chaos engineering and stuff like this. But first of all, its concept of drills and exercises. Someone actually recently said probably better if I call them grills and fires but that’s beside the point. The concept or our goal, the concept is that drills are to build up confidence for an engineer in a certain amount of technical competency, whereas exercises or fires is to practice more deeper knowledge on troubleshooting abilities and so on. So for for drills, what we tried to do, it’s it’s very much a confidence building exercise. What we want to do is we want to, we create a drill book whereby we’d have a set of maybe 15 or so 15 years old very simple exercises that should touch you know across the boards, your your infrastructure. And you know, they might be something like a log on, log on to the database to get your get the logs not might be just a simple thing. How much memory is your web server using and so on. The point is that you try to touch every part of your product base and we then start paging our engineers to see how they respond. The key thing is the first time they do is is to do it and very kind of comfortable setting. So that’s sitting at their own desk with you know, their own team members around to help. After that. They do it individually book and it’s at their own desk. Then we start using their own call hardware and then call the wrong hardware in a variety of circumstances. So for instance, you are you able to perform these, on the face of it simple tasks, but under difficult conditions. Difficult condition might be from Starbucks, you know, from using someone else’s hardware, and so on. So this is this is the whole does the whole concept of drills is to build up competence in an engineer to do you know, do these sorts of things.

Dave: Yeah. Yeah, no, it’s, like one of the examples you gave about not having access to the VPN, or, or whatever. That really made a lot of sense because, like, you know, we’ve got different rules where you can’t connect our infrastructure through different means. And so if I were out, what would I do?

James: Yeah. And sometimes the reaction is, yo, hey, you know, I know how to log on to database white. Why are you doing this practice page to log on to database that’s not teach a new like an example for me was when I had a real page, I discovered that my broadband provider gave me an ipv6. An IP address, which I wasn’t able to whitelist on our 06:17 . So that turned out to be a good drill for me and I had to figure out new ways to access the system. We also had a situation where you, we also have a situation where someone’s VPN client was at the date. And you only actually see these things when you do them for real. And once you’ve done them for real, or once you want to be enforced to think about them under the simple scenarios, then you can prepare mentally for how to do it for real. One of the you know, one of the things we talked about on one thing is feedback we’ve got when we’ve done these sort of drills with other teams is everyone intrinsically knows how to do their job. But when they’re faced with an unknown, they very often panic but by practicing the simple thing of getting access to a system, their mind is kind of taken away from the problem and they get this kind of the kind of go into this kind of muscle memory mode where they all they need to do, they don’t need to think about the problem for now. They just need to practice the team that they have built up your muscle memory, and they kind of are able to get to the problem real quickly. We like to relate this when we’re talking to people about meantime to recovery. View of an SLA with our customers that will often you know, say you know, you need to have a meantime to recovery or one hour, but if it takes an engineer 25 minutes or 30 minutes to access where the problem is your meantime to your minimum meantime to recover is immediately 30 minutes.

Dave: The that also, you know, removes all that cognitive overhead that, you know, we all have that okay, now I need to figure out the problem but I also need, like I said, get to the problem. So it’s not even just time it’s that you can be actually have adrenaline rushing, because you’re like, Oh my god, I can’t remember how I’m supposed to connect to this thing, because I haven’t connected in in six weeks, six months, six years. Who knows?

James: Yeah, absolutely. And it’s, it is it removes a lot of the pressure. One of one of the other things that we do is we, when we’re doing these projects, fire drills, we generally try to like use and use the actual tools and engineer would do, so we page people on their phones. And it’s kind of interesting is that because we as as the engineers get comfortable and they progress through them and start doing other things like giving them at random times and so on. We try not to do it at various both you know we do various times a little bit but for an engineer the solution they will often get a shock when they suddenly see you know, strange phone number come up on their phone and they’re going “In what gods name is this”, but by them becoming used to the fact that okay, this is something I know how to handle. When the call comes in for real they’re not kind of you don’t have that deer in the headlights kind of effect.

Dave: Yeah I know, that that is very so when I first thought of the fire drills. My impression of what we wanted to do was because we’re not ready for chaos engineering yet. We’re too small. We don’t have enough people, were close because we use Kubernetes. So that at least helps. Our pods restart themselves. So I can kill a pod, no problem. But the the thought was let’s come up with scenarios and then work through all the possibilities of how we would go about solving the problem. So that we sort of had a an outline ahead of time and didn’t have to start from zero. But the drills makes even your version of drills, so what I was thinking are actually what you’re calling exercise. And the drills take that even a step further and you know, help avoid having Wait, I need to install this Oh, but as your clients out of date and so is the cube controller, cube cuddle so.

James: It’s Yeah, and chaos engineering is kind of, going to kind of discover new buzzword kind of cool thing to do, but I very often think that it’s designed or did way it’s been talked about. It’s designed for the cool kids, but you know, you have to start somewhere. You know, you know, trying to practice what look what what it looks like if you’d lose an entire availability zone, you know, it’s a cool exercise, but it’s not really helping someone to, you know, to build up confidence by setting their house on fire for the first time. The way to build up confidence is for let them do their day job in different conditions. One of the other interesting things is like, yeah, this is this is really cool for kind of a startup, to start bringing people on board and on call and kind of junior engineers and getting people used to their to this brand new system. But in one of the things about we also found this kind of it in the larger organizations, we very often have these kind of services that aren’t, let’s say, part of the core product or, or legacy product or whatever. The institutional memory of how to do these things gets lost or else get siloed. But by including you know access to these systems in your drills is a great way to keep that memory fresh and also to spread it out for the silo and in a very safe manner that people don’t feel stressed about. I, I tend to story about a particular platform where I used to look after was very legacy almost never got calls about this it was rolling more or less solidly for seven years or something ridiculous. I was the only person in the organization who knew how to log in, which immediately is not a good situation. But it was even was it was particular scenario where we did have to log in and the RSA keys had the batteries and RSA keys have gone. Simply because nobody had gone near the platform and two years. So it’s another example whereby, We go a simple drill like that just crops up these these issues before you actually have to do it for real.

Dave: Yeah that’s I mean some of these things you just can’t plan for without doing something like these drills. So it’s I, you know, I’m really glad that I’ve actually started to create the ability within our slack bot to be able to do it. Actually call the bots function to give me drill.

James: Yep. I think we’re, we’re, we’re, we’re working on concourse pipeline to do a similar sort of thing. So let’s Yeah. Yeah, the drill are important for building up confidence and muscle memory. It was next thing is maybe talk about exercise which is the next kind of, level up so for exercises what we tried to do is have those kind of a more of a team based thing rather than kind of people working on their own. What we tried to do there is we kind of set the house on fire this time, but we try to come up with scenarios that are technically complex may require you to interact with other teams around you. So you can imagine your your operations team and it’s a complex application running on us. Both you may need support from your database team, you may need support from your firewall or networking team or whoever it is. What we try to do is we try to organize scenarios whereby we break something in a complex manner. And we we then page someone and they don’t have to follow through and troubleshoot hopefully with the, with the exercises or sorry, with the drills that developed already built up, that you’ll be able to react pretty quickly and get to where they need to troubleshoot pretty quickly. One thing I will say is though, just kind of back, when someone is paged with a drill, they are page in a very prescriptive this is what you need to do, it is log on to the database get the last line of logs. That’s it. In this kind of exercise or fire manner, it’s actually very different. They’re given almost minimum information they’re given was a customer might see your server is given 500 errors. That’s it. So they immediately have to figure out what server they might have to figure out, you know, where’s that hosted? How do I log in and so on. But, again, because they’ve built up the muscle memory, that bit should become very easy, very quick, and they should be able to start troubleshooting. So drills are not there to trick someone exercises. Yeah, you can kind of try to screw around with people and exercise with them.

Dave: Yeah. But it also should be legitimate. It should be the most complex bizarre bug you’ve ever seen, though. One of the things that are also plan to do is with each outage, add it in, and, you know, document out in Confluence, what it was so that..

James: Yeah, so like I like that’s an excellent point you, you, your exercise shouldn’t be based around a problem that wouldn’t occur in real life. And so they should be designed to be as realistic as possible. Again, I’ll give kind of two examples of it, a good exercise actually to try is to simulate clock drift on a couple of your servers because that generally starts breaking things like SSL and has all sorts of weird symbols and SSL’s can be, can be surprisingly difficult to troubleshoot. And we’ve to run these exercises several times with people. And it’s actually quite interesting when we do go drill down to exercises, we encourage people to continuously update, let’s say, pagerduty or whatever it is you’re tracking your instant with and its I was running one of these exercises, and the engineer kept on sending me posting log entries. And each time they did the log entry had naturally have timestamp in it, and the post time had a timestamp in it and they were basically kept on posting this saying no, no, this is naturally our problem. Yeah, what we would do is ask down, just kind of went on, until I eventually sent them back a screenshot of both timestamps. And yeah, that was an example a good simulation. And possibly an example of a bad simulation is we have, someone who’s trying to simulate network congestion and … drop packets on the network. And it did this spot using a set of IPtables rules seems like a reasonable thing to do. The problem is what the engineers troubleshooting disk were doing was counting pockets using ping that IPtables rules didn’t account for ICMP. So they were saying zero packet loss. So this is not a very difficult troubleshoot. So when creating scenarios, right, be realistic as possible. And one of the things that was alluded just there is keep logs of your real instance. Your real incidents, and build up a kind of a manual or sets of these things. Talk to other teams. In our case, we often look at support tickets to real customers come along, and we kind of go, you know, would one of our other teams be able to solve this, you know, without you’re able to build this big broad spectrum of exercises for people to experience.

Dave: Yeah. Let’s say some of the ones that I’ve already put in our system already. We shipped test and it worked. And it’s everything’s fine. And now we shipped abroad and the pods just keep restarting. So I mean, it’s, it’s it doesn’t have to be complex, but it’s things that actually crop up and knowing beforehand, you know, a new person comes in. They don’t have to have experienced that to say, Well, let me check the config maps to make sure that configmaps match, to make sure the secrets match.

James: So Yeah, it’s, it is good for learning, it’s not difficult for learning our environment. There’s a colleague of mine who does this, does this great talk about kind of what we call Dungeon and Dragon exercises. And so what we would do is we would take a scenario we’d have a room full of engineer sounds you would have two people being kind of dungeon master and first person like cutlet, okay? Our main application server is getting 500 hours, what do you do? And then from the group there, they will say, oh, log on to your engine X Server, get your log files or whatever it is. So the dungeon master will then conceptually go Okay, you’ve logged on, what do you do now? Oh, check for errors. There’s no errors. What do you do now? Oh, check for memory utilization, check for disk, whatever it is. As you give them clues, they conclude they start building from mental model of you know what the problem is. And as a group exercise people can then start discussing go, Hey, why did you check for, why did you check for disk space but didn’t check for insufficient inodes. Why, I tried to take more examples, but it allows people to kind of, you know, to start discussing this, and then people kind of go, it actually get us kind of a good troubleshooting technique, I wouldn’t have thought of running an S trace, I wouldn’t have thought of, I wouldn’t have thought of looking for DNS, DNS responses on the book when we use in TCPdump or whatever. And it becomes a discussion and people build up a more holistic way of troubleshooting. One of the, as the my colleague, who gave a great talk this just recently. One of the things the way she explains a lot of she starts off with this really simple kind of dungeon dragons scenario, this q&a. The audience solves it pretty quickly. But then she goes, Oh, damn, the next day this happened. So the audience now from the first scenario has learned a little bit about her infrastructure. They can use that knowledge and leverage that to go on to like the day two problem and then she goes and then the next day this happened. And again, you know, whatever, as you know, from answering this, the second exercise, you know, a little bit more so, it helps you to answer the third exercise and so on. But while she’s doing this, she has a whiteboard, and she’s actually drawing out her infrastructure. So, I’m sure like many people have joined a new company. At some point. Someone’s brought you into a room was brought up some mass of Visio or PowerPoint or wherever the hell it is and liquid a million boxes on it and they’re telling you that disconnects to that but only if disconnects that that and you can authenticate against the other thing. You just goes in one ear out the other. But her framing on solving these small problems progressing less allow the audience to start building a mental map of her infrastructure in a way that’s really compelling and a way that’s memorable. You also it these are dungeons dragons exercises are really good learning tool and situations like that.

Dave: That’s really neat. Yeah, I mean, that’s kind of an interesting idea that you could possibly use maybe the first couple of steps for an interviewing..

James: Possibly, yup.

Dave: You know, actually taking taking those, those add on bits as opposed to just dumping some random question. You know, taking questions that build on each other based on your real infrastructure. So that you bring them along, as opposed to just dumping that on on someone. You’ll learn more about the way they they learn and think.

James: Yeah, absolutely. And like, as, as my colleagues, versions of dungeons dragons I was like, it’s, it is really interesting. I think she can even explain the whole idea you’re in a dark dungeon, and she’s kind of talking about, you know, as you learn more, more of this kind of Labyrinth gets illumination, and, you know, the more illuminated it gets, the more you can illuminate. And it was, in my mind it was a really interesting way to learn about her. You’ll make believe infrastructure. So it’s it’s a really good tool and by the way, that’s, that’s a good tool for any new engineer or for an experienced engineer. One of the ones we did was actually when a brand new team was being formed, it was actually good bonding session as well for the team because they got to know they got to know how to a new workmates kind of taught, what source, you know, skill sets and so on. So its a, first it’s really interesting, I might be able to provide you a link. That talk it’s pretty good.

Dave: Please do. Yeah. Sounds good. That’s the exposing the with light and finding way in the darkness. That’s really is what we do.

James: Yep. Absolutely.

Dave: And I think that may be why so many of us, especially those of us that have been doing this for quite some time, have found interest in the old tech space games or the multi user dungeons. The muds, back in the day. That’s what we’re basically doing that. So for those of you that are a bit too young to know what a mud or multi user dungeon is it was text based World of Warcraft basically.

James: Absolutely, turn left.

Dave: But now we just do that with ping and traceroute.

James: Yep. Yeah, it’s it’s a real fun exercise. You reminded me of something they’re and for the life of me I can’t remember. But um, yeah, it’s a good thing. And I would find like this is a really good exercise to put your team through real takes, another kind of like we’ve had two examples of exercises now one is the break stuff in a wacky way and see how your team responds to it. And then we had to kind of conceptual Dungeon and dragons exercise. But one of the first exercise we kind of mentioned to people and remember this time exercises should be more of a group thing. is it’s a real kind of no op exercise that when you suggest to a bunch of engineers, hey, we’re going to do this. They look at you as if, Do you think we are four year olds? This is like the silliest imaginable thing. And what you do is you get everyone and sit around in the conference room over virtual conference room. And the first person pages, Billy, Billy pages, Mary, Mary pages Annie and you go around in a circle. And I’ve been in teams where we said, okay, this is what we’re going to do when they’re kind of going Oh, I thought you were going to be here to break stuff and show was how cool things are, no let’s do this first. What we have found out every single time we’ve done this is at least one person hasn’t, let’s say got pagerduty installed. One Pair is in the house pulling the wrong phone number because they only just moved to that country six months ago or whatever. And this was on the face of it sounds like a silly exercise actually is probably the single most useful exercise you can do. And nearly every team should try to do this once every once a once in a while especially for people who don’t get called very often. In fact we have a we’re in a pretty lucky position and I hope I don’t anger the gods of IT by saying this. We have we have a situation at the moment where we call on call atrophy which means we don’t get called enough. And what happens is people change their phone number and because they haven’t got calls they don’t receive the call which is obviously really bad thing. So we’ve started the on call engineers have agreed that once a week while their on call between let’s say six o’clock in the evening, nine o’clock and even they received one of these simple drills and that you know, when I told other people Oh, yeah, this program, dude, oh my gosh, I’d be so upset. And so people are on call. They see this. They’re able to do successfully. They know that if they get a real call anytime during the week they are no prepared. And so instead of becoming a negative take the feedback from all the engineers has been asking you this was a really positive thing. It was a real It was a reassuring thing for them again it’s kind of building up confidence.

Dave: Well, that’s that’s the big thing. I mean, most of us wouldn’t be in the situation’s we’re in if we weren’t at least pretty good at what we do. Right. So usually in the past, we’ve been able to figure it out. But that black box staring at you. When when the pressures on that for some of us anyway, it can be nearly the debilitating.

James: Yep, absolutely. I’ve done this a lot of teams and sometimes, like we did a particular one particular kind of role play. And my suggestion going into this was to do the drill things and then we’ll come along and facilitate the role play. The team kind of said, Oh, we didn’t have time and they all know how to do this. So, okay, so let’s do the role play. And the engineers literally looked like they were deer in the headlights and the feedback what would your feedback from doing the drills still positive? Because they felt they got a better understanding. But also the feedback well whatever laterally like to have done the simple things, and I think it would be more prepared. And so thing about this is don’t be cool. Do what what’s best, kind of reminds me like my nephew is well yeah, so he’s black belt in kickboxing and his training is repeating the same movement over and over and over again. And an outsider looking in to that seems like stupid thing but then when you see him at a competition he has the muscle memory built up days able to do these moves, super quick and super competency. So that’s the way you should think of these things.

Dave: But yeah, any sport, your first day you don’t go and play match against the the hardest opponent you’ll have either you don’t climb in the boxing ring just start swinging. You learn the fundamentals, it’s not even just learning them because even if you’ve been away, yes and come back, and nobody’s having an incident every day of the exact same kind, decrease hopefully.

James: Yeah. The Crucible Snookers on at the moment, these are like top snooker players in the world. And if you watch them on their break, they’re sitting there patting balls from set pieces over and over and over again. It’s you know, this, this is something that everyone benefits from.

Dave: Yeah. Have you had anybody stay unhappy with it after doing?

James: No. Because people invariably do get caught out. That sounds a bit strong, but those kind of recognize, oh, I wouldn’t have been able to do this for real. And everyone’s kind of happy that they’ve done it. And also this is also kind of a shock factor as well as like, pasta pager. And since I was saying earlier around, whereas like someone’s phone numbers wrong. We were in a room with eight people. We got to the fourth person and it was like, Oh my gosh, it’s the wrong phone number. Immediately the remaining four people took out and we’re checking their phone. Because they didn’t want to be dock up. And so even though they weren’t the person who got who discovered this situation, all of them benefited from it and suddenly went, Okay, this was actually a good idea. Maybe your team is pretty perfect but ah, these things happen. As I said, like, actually, very first time, I got a call out in here, I discovered that some of the CLI tools are live for were days. So I’m like, okay, I’m sitting there preaching to people to do this. And here it is, my laptop has the wrong software. So yeah, people always bet, I feel people also so..

Dave: Yeah, definitely. How are you keeping track of your drills and your, your exercises.

James: So we have we’ve kind of a centralized it’s a little bit too ad hoc, it’s a little bit too ad hoc, some of the other teams are getting a lot better at it than us, they’ve kind of come back and start teaching all stuff, which is good. But what we do is we have a Google Doc with kind of the setup 15 drills that we kind of core do. Below this is basically each, every team member needs to do the entire set of drills at least once, then after the they should be kind of given each drill kind of randomly spread out. But they should be able to do everyone drills under various circumstances and that’s before during their, their, you know, their suitable or they will feel confident enough to go on call. We don’t try to just keep a log to make sure the individual individuals kind of have received at least one or two of these drills. You know, Maybe every quarter or every, you know, every couple of weeks, when the other things that we tried to do as a practice is that off going on call person, pages to ongoing on call person at some point before the handover course. So that means that, you know, all the ongoing person has confidence that, you know, they have the right equipment and so forth. And so that’s that’s one of the practices. So it’s kind of the log book standards kind of hand over page as well. The other practices what I’m doing now, which is to kind of be on call atrophy things to do this slightly hours. That’s kind of ad hoc at the moment because we’re, I think I said this earlier, we’re building up concourse pipelines to start triggering these things and automating a lot of this. Just generally how we do.

Dave: Other than what you’ve mentioned, are there any other mistakes that people should look out for when they began trying to do these?

James: This is going to answer a completely different question which you didn’t ask..

Dave: No problem.

James: It this is possibly mistake I’m as is designing these things. And it was actually at the meetup. Someone asked a question, and I kind of thought about it, and actually think it was a really a good addition. So what I do recommend that people kind of use all the tools that they would use to handling a real instant. And someone asked me says, Do you include your SLA in that. I hadn’t actually thought about it., and then I said, actually, no, that was actually dead, right. We try not to put someone under pressure for either drills or exercises. We want the experience be a learning experience. We don’t want to turn it into a race. We don’t want to. We don’t want to turn it into a stressful time thing. But we don’t want to get, you know, what you don’t want to take part. So my initial reaction was, oh, if we turn it into an SLA thing, this is actually, this could like, say you’re just, like turn this into a kind of bosses versus the engineers type thing. But then kind of thought about it a bit more, because you don’t necessarily have to add additional pressure, but you do have to add those best practices. So desolating is you’ll make sure you answer within your 15 minutes, have a response in 30 minutes, early updates and so on. So that’s, that’s, that’s something that’s pretty useful addition to the sort of exercises and drills.

Dave: Yeah, that I can’t remember where it was. I heard it but I think it was a card game. card game about an ops fire drill card game. And basically, you only get one turn at a time so you can only play one card. And then one of the things that comes up is when you have a you know, this is more for a larger company startup but you have a VP breathing down your neck because you haven’t contacted them. Maybe you should have started to delegate some of the tasks..

James: Yeah, that’s it that’s actually a really good point. A lot of lot of these lot of these concepts that I’ve taken from the Google SRE book has particular chapter is about incident commanders or sorry, incident handling. And in that it talks about this idea of having an incident commander, I talks about having kind of someone handle communication and so on, as another thing you should do when you’re doing these group exercises to end it, was to start delegating these tasks and figuring out who does what when. A particular real instance I was involved with was is quite hope high profile and we did a delegation and worked incredibly well. One of the product managers, basically was the instant founder and other product manager took over communications with outside factors, put leadership and so on and it worked incredibly smoothly. But I think one of the real reasons it worked incredibly smoothly was we had gone through this almost exact scenario, just two weeks earlier for different customer site. It was just complete just serendipity dub, almost exact same thing happened in the exact same scenario under a different situation that it was weird, say to the least.

Dave: Oh, yeah. So on with the, you know, contacting the people in the incident communication management. You know, I don’t know how big pivotal is it. Guarantee it’s bigger than my smaller startup. But what we do is, what we try to do is spin up a Slack channel, specifically for the incident. And then that way people can put their heads in and don’t have to keep asking. What’s the story? What’s the story? What’s the story? Yeah.

James: And yeah, that’s just a, there’s couple of ways to handle is as I’ve seen, there’s couple of ways to handle it, I think it’s important that as a team, a you decide on the single way..

Dave: Yes, as with almost everything

James: Correct, and be as kind of, to external factors that you, they understand the single way in as well. So an example might be on each team, Slack channel and pivotal. We have if you look at the channel description, we included things like an intro, couple of other information like that, but, if we have something going on, we will have that kind of as bold as you can at the top of the Slack channel. So people know don’t spam the team channel with this go to this other Slack channel, let’s say go to the status duty, or pages duty, page go to this status.io, I can’t remember.

Dave: There’s a status pages status page.

James: Status page, yeah.

Dave: Is that an Atlassian product?, said it Atlassian.

James: I think it is. I’m not sure

Dave: They’re everywhere.

James: But uh, yeah, but the point is to make sure there’s a kind of a unified communication there. You’ve reminded me of something that’s completely off from my mind particular real instance I have. Oh, yeah. Sorry. Yeah. Now I remember it, one of the other things we find useful is or have found useful in other companies is, is we have this we have those concepts of commercial on call. So the way that worked was, you had your typical engineering on call know kind of known loft, you know fixing servers a three in the morning. But what we also had was we also have this concept of a commercial on call, what would happen there is if an instant was running longer or if an instant was the engineer felt was like, okay, just just gonna have a serious business impact they would call the commercial on call who would then start liaising with client and they would become responsible for fumbling information given updates and so on. So the kind of concept there is rather than having you know, the angry VP of your customer screaming at t the engineer what you have because you have kind of teamwork between let’s say to commercial sized business on the engineering side of the business, and you want to seems because your both teammates that there will be more empathy there to a lead engineer to do what their good at. Also the commercial person do what they’re good at, which is communicating to the customer. So that was, that was another idea I’ve seen.

Dave: Yeah, that’s good. I mean, you know, it seems like all we’ve talked about here is just removing the stress from the person trying to make things right.

James: Oh yeah, absolutely. If, I, only kind of a pretty small sample size both when you see people coming in for interviews are applying for jobs or joining a new company when you hear the words on call it is immediate stressor. Very often it’s a, it’s a situation where people will say, No, I don’t want this job or you know, it becomes a deal breaker for people. Well, that’s why I feel so important I’m quite passionate about it is to make it a role that’s becomes exciting, rather down something that’s going to burn you out. There’s a talk briefly about the Google SRE like one of the heads of Google SRE there’s this quote from one of his children and was something along the lines are, don’t panic when the crisis is happening for you won’t enjoy it. Like what we talked about earlier on was like doing these text based adventure games you take people like solving problems. They don’t like being under stressed. So if you’re able to be able to get it out on are solving a problem really quickly, and you’ve kind of done the sort of stuff before you actually can become a pretty, pretty good books.

Dave: Yeah, absolutely. That’s, you know, that’s growth happens when you’re just out outside your fear of comfort. And if you have to wade through the muck to get to that edge, you’re already so worn down, but it’s really hard to enjoy that.

James: Yeah. Yeah. And, you know, again, to go back to the exercises, I said like the example I gave, a clock drift is like, when you see is it’s kind of one command fix. The nature of the problem doesn’t necessarily have to be kind of clever, difficult or whatever. But the , thing kind of thing does not, there’s no such thing as hard questions if you know the answer. So, you know, what might seem straightforward to one person could be incredibly difficult to another but again, by a doing the kind of bigger scale exercises, you will find this out and be by doing the dungeons and dragons exercises as a group, you get to discuss these things. And you can get to understand why. One of the engineers thinks, you know, doing an S traces the obvious thing where’s one of your engineers will just read the source code. You know, it’s very different.

Dave: Yeah, and one thing I would add to that is our plans, keep saying plans since we haven’t actually implemented yet, but would be to identify steps we could take to prevent that issue from ever happening, doesn’t mean you can actually implement it now, but as long as you’ve identified something that you could possibly do in the future, to make it a non issue

James: Yeah, I interesting thought about this, I think I think it was actually you asked this question. So, RCA’s are really important for a whole heap of reasons. Sorry, Root Cause Analysis are really important to look at your what what caused this problem. And it began, in the SRE book, it talks about can do and post-mortems and so forth. And I think it is important, I think, especially if it was a process failure, and very important to identify identify this process failure and try to fix the process. And I certainly, you know I certainly encouraged in RCAs, but you also have lived in real world whereby to be frank a 100% of time for everything all the time over a long enough period of time, it’s just not factually are possible. So it means that rather than building up a whole heap of RCAs for instance that will probably never happen again or she was like playbooks based on RCA’s. For instance, that will probably never happen again. What you need to do is build up the holistic skills to serve sold the generic problem and I have, and I don’t go too far field I either like a good example of this is monitoring, so you have some sort of wacky thing and a log file overruns or something as misconfigured and it creates a million zero by files or whatever it is, and you’re on iNotes and you often get this thing. It’s like, oh, why didn’t we monitor that, I was immediately okay we have to set up monitoring for this one very weird edge case by the way I know it’s not an edge case and you have to pull in and you may down have a playbook is like old before do we can always check you know, the number of it and what happens there is people end up what I call cargo coating, what they’re doing. So you have these like, you have these kinda like 15 steps for checking exotic parts of this your system. nobody really understands but it was because they came out of an RCA when I put into a playbook that are now part of some sort of checklist, but it gives your engineers know holistic knowledge on how to actually troubleshoot. So I kind of sometimes these tips are things are useful to a small degree, but for me, the holistic steps of kind of building muscle memory and then building up a ways to apply as far more valuable.

Dave: Oh, I absolutely agree. But I also think there’s some value in pre modems. So that’s what that’s what I’m considering here is that by doing the dungeons and dragons exercises, especially if something that actually has happened, yeah, right. Like especially the like the example I gave of who shift it’s test, everything’s fine. So now when we tell our boss and ship it to prod why is it failing? Okay, so we know we need to check the secrets in the config maps. That could be an actual gate to prevent the delivery to production. So that’s the sort of thing and not not those weird.

James: Yeah, that that is a good good example of a gate, bad example of a gate is, you know, obscure memory leak and the only, I still remember during the Sydney Olympics, the Sydney Olympics Australian government for television reasons or whatever channel decided to shift the date that it did daylight savings.

Dave: Oh my gosh. That’s gonna mess everything up.

James: Yes, there’s a lot of things.

Dave: And so I’ve heard the European Union is considering not doing daylight savings time as a person. That’s fantastic. As a computer, as a computer person, that’s scary.

James: Well, all I gotta say is everything should be in UTC all the time anyway. But that’s .. But yeah the upshot of this was, I think some still own at a time still had to ship emergency patches to the JDK is and we had one customer that was that was that was based in Australia who didn’t patch her JDK and wacky things turn.

Dave: Of course it did, my goodness

James: So, yep, so that’s something that a probably couldn’t be practiced for. But be, you know, some of these troubleshooting steps might be able to kind of figure out that be something that, you know, if you put it onto a playbook or monitor whatever, it doesn’t make any sense because you know, daylight savings doesn’t suddenly get switched off on a regular basis.

Dave: So, is there anything that we haven’t talked about?

James: No, I think there’s kind of a lot of things that I would have a tendency to talk forever, but fortunately, I can’t this evening.

Dave:I think we could probably do a whole another episode on monger.

James: Oh, don’t get me started on monitoring. I can tell you I can start. Yeah. I could certainly talk a lot about monitoring.

Dave: I’d enjoy that. But so what’s next within this this fire drill realm?

James: Em, I have. I have a blog post on I’m hoping to come out soon, which kind of covers a lot of these topics, possibly puts a bit more structure on it. Great. So that’s, you know, that’s the kind of main thing for me at moments. within my organization, we kind of evangelize doing these sort of things. We’ve also kind of, we’ve started trying to include these customers, or sorry, these practices and customs on when we’re trying to onboard new customers. And that’s how interesting effects on it’s, the feedback so far has been quite positive so.

Dave: Yeah, if you get that done soon, I’ll attach it to the show notes.

James: Okay.

Dave: I can always add it later.

James: Okay, cheers.

Dave: So what would you ask our technical listeners to do? In relation to what we’ve talked about, so the CTOs, these ops types

James: I so I think they should, the great things that I to season engineers there’s often the pushback we have discussed this lot all to kind of go into it. But I do think they should try to champion these ideas. Because very often a way to champion on this get to see an engineer to do we’ll put together the initial grill book. I think the benefits are someone who’s just joined the team and someone just been on board the team are pretty self evident. I think the dungeon dragon exercise again for new members is incredibly interesting. Before team it’s actually a great building team building tank, like you know, maybe one Friday, a quarter we do team building, you know, get some beers in and set aside error and a lot of people have some form well they’re learning stuff and learning new things so this is what we’re doing you know that’s the kind of in a way the technical value of a lot of this I would also kind of, I would damn kind of frame, to the people with the sign to checks. maybe is kind of framing the importance of this I’ve already discussed about common things like meantime to recovery and the ability to have an engineer get their hands on the problem as soon as possible is actually is hugely valuable. It would also you know doing these sort of things will hopefully also encourage or give confidence to more people to go on call if you have more people going on call, or need your support, roster gets bigger you don’t have people suffering burnout by getting continuously on call for long periods of time and cuts down which in turn cuts down employee turn there’s like a huge human benefit to this. There’s a colleague mine had a great talk about building HA and to humans. And this is a lot of what she talks about is, you know, people being burnt out. And, I know, colleague of mine from a colleague of mine who was not only on call for 30 days straight, but was actually called out for 30 days straight. And they more or less rage quit the company and you’re having to replace an engineer is an expense of things so doing these things to get more people on call stops us are still happening and you know, and apart from that it’s a good human thing to do.

Dave: Yeah, that’s a that’s one thing definitely to consider as co founders or anyone pushing a product you’re also concerned with how stable your product is for your users or customers. So it’s not wasted time by making sure that you can support issues as they come up. It’s it can actually be more important than pushing new new features sometimes.

James: You know, I already I already kind of use the expression for kind of ha for humans. You’re going to spend with us article consoles might go to example of software for everything. Like go you go off you spend millions of dollars millions euro on ha ha cluster licenses for article and so on. So, you know, is spending kind of 15 minutes a week on the sort of exercises, which will you allow someone to get the fix this Oracle cluster and expensive thing and my wide answer is no. And it’s a huge amount of value.

Dave: Yeah. Great. So how can people reach you?

James: I probably drop me an email address on

Dave: Sure. Yeah, I’ll take a stick that in the in the show notes.

James: Okay, cool.

Dave: All right. Well, thanks very much. It’s a pleasure.

James: Okay.

Dave: Cheers. And thank you everyone for listening.

Until next time, remember any sufficiently advanced technology is indistinguishable from magic.