Blogs

Discussing DevOps Metrics for Software Reliability & Quality: Engineering Insights Podcast Ep. 6

Junade Ali
May 19, 2021
  • toc-h2
  • toc-h3
  • toc-h4
  • toc-active

Junade Ali discusses DevOps Metrics for Software Reliability & Quality with Julian Colina (CEO of Haystack Analytics) and Kan Yilmaz (CTO of Haystack Analytics).

Listen

Transcript

Junade Ali:
Hello, and welcome to another episode of the Engineering Insights podcast, presented by Haystack Analytics. I'm your host Junade Ali. Today's discussion stems from an internal company chat on the best DevOps metrics to measure software quality. I'm joined in this discussion by Julian Colina, the CEO of Haystack Analytics, and Kan Yilmaz, the CTO of Haystack Analytics. We hope you enjoy this unfiltered conversation.

Junade Ali:
Hi, guys. So today I'm joined by Julian and Kan. So Kan is the CTO of Haystack Analytics, and Julian is the CEO. And together they're the founders of Haystack. And this podcast really comes out of an internal discussion we were having about both what our customers like to measure quality in engineering, and also what best practice should be. So perhaps that's a really good area to start. So to start with Kan, if you could tell us initially a bit about you and introduce yourself, but also tell us a bit about what you are seeing in terms of how our customers really like to measure engineering quality in their organizations.

Kan Yilmaz:
Certainly, to introduce myself, my name Is Kan, I'm the CTO of Haystack. Prior to this, I have led a team in Cloudflare, we were quite a fast team. We were constantly iterating, and because the team had this attribute, we were able to produce 10% of the whole company's revenue with only four to five engineers in total. That's 1% of the company's total team size. And it was all thanks to iteration speed. Coming back to Junade's question, which is, what do people look for? I have to come put this into three different categories.

Kan Yilmaz:
The first category is I have a team that I'm managing or multiple teams, I'm an engineering manager. And how can I actually make sure that I can move this team forward? Make their bottlenecks disappear, have actionable items, have actionable processes that I can introduce to these teams and make their software more quality. So that's one section.

Kan Yilmaz:
The second section that I want to mention is more like, throughout the organization, how can I actually ensure that there is a higher quality of software going through forward? This usually happens by higher level degree engineering leaders such as the head of engineering, and so on. They try to understand the bottlenecks in the whole organization, not a few teams, and introduce processes which will affect the whole organization itself. So that's another part that they come through.

Kan Yilmaz:
And the third type of customers that we actually see is, they want to understand in a bit more micromanagers sense, how can I be better in one-on-ones? How can I compare engineers? And figure out the best engineer, the worst engineer and move the worst engineer to the best engineer. They usually have really good intentions, but eventually, they actually want to compare engineers. Figure out how they can be more successful in one-on-ones and try to improve their quality regarding the person rather than the process itself.

Junade Ali:
Awesome. And I guess, in that area there's a few different kinds of solutions we offer. I mean, Haystack, we're not really in the business of helping people micromanage their engineering team, but certainly in terms of boosting quality and helping people go on that journey of continuous improvement. So perhaps we could drill in a little bit more about what the tooling we offer to help with that is, before I pass over to Julian.

Kan Yilmaz:
Sounds great. So I can actually go with the simplest one which is, for the type of customers whose problem is, how can I make sure I can move from a lower performing engineer to and higher performing engineer? This kind of use case, Haystack does not help you at all. We directly say to them, "Haystack will not help you. This is an anti-pattern." An engineering organization, an engineering team, by definition, is a team sport. If you do it by yourself, then you're actually ignoring most of the value which is generated such as, we say quality. Everyone has pull requests review processes. But if you only focus on yourself, you will definitely make sure other people get blocked on pull requests review, and the whole team's productivity will actually go quite low. Their quality will go quite low as well. So comparing engineers by defined metrics is actually an anti-pattern and Haystack does not give any kind of tooling to support this.

Kan Yilmaz:
Once we move to the other parts such as, how do we improve the organization? How can we improve the whole organization, find the bottlenecks, and fix it? So to move forward with that Haystack has high level KPIs, we call them North star metrics. There is the speed part and the quality part, because this conversation is more on quality, I will focus on that part. We provide for a quality metric, which is called Change Failure Rate. Change failure Rate is, number of hot fixes or rollbacks divided by number of deployments. So what is the percentage of deployments that actually failed, that resulted into an incident. We track this as a high level metric. And by surveys that Google and a few more teams have done, they have figured that elite teams all the way to normal teams, they have zero to 15%, Change Failure Rate. Which is the metric that we track. But if you're a low performing team, it actually is 45% plus.

Kan Yilmaz:
So almost every one every two deployments fails. So, you do a deployment then you immediately do a fix afterwards. That means you're actually having low quality deployment process. So Haystack measures this, gives you a way to deep dive into how can you actually improve this metric? What are the actionable items that you can take? We provide from low level to high level, all the way. So you can actually look into this metric, put that as your KPI, figure out in your whole process, "Okay, currently, our change failure rate is 40%. Our goal for the next quarter is to decrease it to 10%." You make a decision, you have that goal, your team has a discussion. They look into different metrics that they can actually figure out, how can we improve into... How can we make this metric go better? Then what they do is, they start iterating. They slowly add more processes, more features and eventually, they decrease that quality metric into a healthier area that has been proven by quite a few studies. So that's one side.

Kan Yilmaz:
The other side is how can I make sure my individual team is actually getting a higher quality, releasing code in a much better sense. We have a few methods to do this. The best method is again, the North star metrics. Change Failure Rate, the whole process is the same. Individual teams can actually look into this, they can see the differences between an iOS app and an Android app, and what is the difference between their quality. And then they can figure out what kind of process can we add into this engineering team, to actually make it better. But we also have more actionable directly affecting stats such as... I would call these as best practices in the form of notifications. So we put in the engineers Slack channel, or email depending on what they use. They will get certain best practices. These might be such as, this person is handling seven different pull requests at the same time. They're multitasking. There's a high likelihood that first, these features will probably be delayed, and you will miss your deadline. Secondly, because they're multitasking, they might not have enough context switching or bandwidth to think critically on each task and the quality might be low.

Kan Yilmaz:
Another example is if anything is merged without a review, or any commit is made into master branch without a pull request, so nobody it, it was hidden. We can discover these, let the team know so they can take the appropriate action. And we have seen these directly improving on the North star metrics for quality specifically.

Junade Ali:
Awesome. So I guess to really summarize those kind of... On one side, we've got the four key metrics. The four key metrics, these are metrics that were taken really, from the research done in the status DevOps reports, and the Accelerate book, highly credible, highly scientific research. And these four key metrics effectively... Companies which were strong performers in these were two times more likely to succeed in both their non-commercial and commercial goals. So everything from customer satisfaction through to market share. But additionally, over a three year periods these companies had a 50% higher market cap. And two of these metrics really are in that scope of quality. So, Change Failure Rate, you spoke about that. And then there's also Mean Time to recovery as getting that quality dimension. But that doesn't tell the full story.

Junade Ali:
So, for risk notifications in advance, at Haystack one of the features we offer, is really these risk notifications. If it seems like the team is going to burn out because Throughput's abnormally high. Or if there's stuff which has been merged without review, or merged without pull requests, or people who are having two tasks, which too much, because there's too much concurrent work. We can really notify and provide alerts on those things. Which are really in advance, these things needed to materialize. What I kind of wanted to focus on next though, Julian, what are the trends you're seeing in terms of what customers are really asking for? In terms of taking their reliability practices, their software engineering quality practices, to the next level. What are those kinds of common lines of discussion you're seeing?

Julian Colina:
I'm on the phone pretty much all day everyday, at this point, talking about metrics. How to evaluate this concept of speed versus quality on an engineering team. And really bring forward these concepts that we're talking about. How can we deliver value to customers faster and more reliably? And the research actually shows that we can do both of those at the same time. By improving our methods of quality, such as unit tests, automation, CI/CD, we can actually improve not only the quality of the code that we're releasing, but also how quickly we can release that. So a lot of the conversations actually revolve around that. These North star metrics that allow us to hit that dream state, where we can quickly release the code that's working and give developers the tooling that they need to actually get that feedback on quality really quickly.

Julian Colina:
So those two things go hand in hand. So a lot of the conversations are about, how can we do this from a capability standpoint? And oftentimes in the sales call, it ends up being this coaching session where we help you figure out what you can put into place to help with this stuff. But ultimately, where Haystack ends up standing is, at the end of all these things at the end of these capabilities that we're putting into place, these investments that we're putting into place, how can we how can we reliably measure what that impact is going to be? And we talk a lot about North star metrics, that's where these North star metrics come into play. It's really monitoring that speed and that quality. So Change Failure Rate, how often do you deploy and requires a hot fix, or a rollback, as well as MTTR.

Julian Colina:
So how quickly can we actually respond to issues in production and issues that stem. So it's stability and quality in that sense. But when customers come to us, I'm seeing a larger and larger pattern as people are looking and trying to figure out what quality means for their team. So there's quite a few people who are looking at unit test coverage, build fails, tickets in Jira that are labeled bug, and things like this. And I just wanted to set the tone, because when it comes to the North star metrics, it's really all about Change Failure Rate and Mean Time to Recovery. These are metrics that have been proven across quite literally, thousands of teams that show if you get really good at these metrics, then your organization ends up being more profitable, you have higher market share, you have increased customer satisfaction. And that's because we have this North star metric that we can all align on and ensure that we're able to handle that quality.

Julian Colina:
Now let's let's think about some of the metrics that people often come in with, like attempting to measure unit test coverage. I can, in like 10 lines of code, create all these unit test suites that give me 100% code coverage. Does that mean ultimately that the quality is great? No. But if we, let's say, implemented a practice where we do measure unit test coverage. It's sort of a leading metric, a sub metric underneath these North star. And we're actually looking at how that impacts in the North star metrics themselves, rather than just simply looking at something like unit test coverage, or number of bugs, or build fails. Which is sort of a subset of overall quality. And I think that's where a lot of the conversations stem.

Julian Colina:
Where does something like unit test coverage, which you know, in some teams, they feel very strongly that this is a quality metric. How does that fit into the Haystack metrics as a whole? And ultimately we help you get that North star to make sure that you're really improving on speed and quality. And these sort of sub metrics or leading metrics or whatever you want to call them, end up boiling back up to these more important overarching metrics.

Junade Ali:
Yeah, and I think that's something which is really important to bear in mind, you can sometimes get these teams which often over optimize on very, very local areas. Unit test coverage, certain types of linting rules, things like this. Whereas they could either yield some improvement in a local area, but they could have no impact or even be harmful towards the overall health of the system. Just really important, I guess, to look at these, these North star metrics as a whole. So with those two metrics we've got there, so Change Failure rate and Mean Time to Recovery. I guess we can start with Change Failure Rate. And the definition, I guess, of that is the likelihood of a given change going wrong. So, how many failed deployments do we have, over the total number of deployments? If they failed, it may be a rollback, it may be a hot fix. It could be anything like that. What do teams really expect to be able to see from that metric? Kan, is that something you'd be able to kind of comment a bit on?

Kan Yilmaz:
Sure. So let's think it from this perspective. Again, if you try to aim for perfection, well, what do we expect? We expect absolutely no hot fixes, no incidents at all. If you're moving forward, you cannot avoid mistakes, you will 100% do some kind of incident that will happen. The important part is, how can we actually deliver these features fast, and also recover them really fast? The reason why we have change failure and Meantime To Recovery together is, one of them looks into how often do you actually fail deployments? The other one looks into, if we do fail, let's immediately fix it. We lost like a few seconds of downtime, maybe zero seconds, depending on what kind of system you have, and then we can continue to serve users. You will 100% make mistakes, if you are moving forward, if you are experimenting, if you are trying to grow your company. There is no way to avoid this. So what is the goal of change failure, can we aim for 0%?

Kan Yilmaz:
Depending on the timeframe you choose, probably, you can aim for 0% for a month, maybe three months. But in a year, I would say, if you're moving fast enough, you will have some kind of incident. And speed is as important as the quality. If you're not moving fast enough, you'll be behind the market, so you need to release fast. Coming back, what should we aim for? It's not going to be 0%, it's going to be something higher. We already talked about it, what is the general literature or the research on this? Zero to 15% rate, is an ideal number.

Kan Yilmaz:
If we aim for around maybe one to 10 of each deployment, actually, it results in some kind of an incident. That means we're actually moving fast enough, we have some kind of buffer that allows us to move forward. But let's try to get that concept with Julian's example that he gave, if we tried to aim 100% code coverage, it's kind of like in... The last bit, handling 60% code coverage will take you some time. 80% will take you significantly more. But 80 to 100%, it will take so much time, so much investment that you need to do, and the benefit that you get from is quite marginal.

Kan Yilmaz:
The same thing happens with Change Failure Rate. So it's better to put it in a certain range, and zero to 15 is this range. Aiming totally for 0% is really, really high and the value that you will get for your investments on all these quality practices that you might actually aim for, will actually be marginal. So the aim shouldn't be 0%, 1%, 2%. It's good to be in this range, as long as you know that you're in this range, you're good.

Kan Yilmaz:
Coming back to leading metrics, how can we ensure that our Change Failure Rate or number of incidents, hot fixes that we do, is not too high? So there are quite a few different methods that we can go. One of them, the simplest one, code review process. It allows engineers to catch different kinds of bugs, which might break the production. The same thing, tests, tests are there for a reason. Do we know what we should aim for? To be honest, no. If we look through our different projects, for some projects, 30% test coverage is actually quite good for them to achieve Change Failure Rate which is in the healthy margin. But if we go to some kind of a, let's say, Linux [Lite 00:20:09] Kernel systems, or space rockets. Okay, now it's slightly different. We need to increase our test coverage, we need to make sure that we don't break anything as the consequences are significantly higher.

Kan Yilmaz:
So it depends on the project that the team is working on. It can be anything from 0% to 100%. 0%, there are... We have seen people, we have seen different teams on which achieved Change Failure Rates, zero to 15%, without test coverage. Because they have different processes to handle that. You don't even have to write tests in certain circumstances. It's up to your goal.

Kan Yilmaz:
If you have a goal to improve it, and you check, okay, test might help us, go with the test coverage. Add the limit, 30%, look what happens. Increase up to 50%, look what happens. 70, look what happens. Don't just read some random blog posts, they say 80% test coverage is the best, go for it. You can do 80% test coverage, but maybe 50% of that test coverage is actually really marginal, and doesn't improve the code quality, or the North star metric that we're actually aiming for. So it's really hard to understand this number. If you look at the leading metric, it's 80% good. But if you look at the Change Failure Rate, you immediately understand, yes, we actually made efforts which were not that efficient, and we could have spent those resources somewhere else.

Kan Yilmaz:
So I talked about pull request review cycle. I talked about test coverage. There is definitely the CI system itself, it can catch linting. It can catch prettiers can catch... Not prettiers, but linters can't catch certain issues, which is like a really basic thing that people can add. Coming back to if you do manual deployment, some person might just miss a step in the READMe file. Okay, now you have a hot fix that you need to take care of. So automation is actually something that will help on this.

Kan Yilmaz:
There are hundreds of different processes that you can take. There is no right answer on which process is the correct thing, but you just figure out what is the lowest hanging fruit? What will we do right now, that the whole team is mostly complaining about, that is the simplest to do? The fastest that we can do. And we can improve that number into a healthy margin. And that is the process that I see in healthy teams and teams which have a really high quality of product, and also they're delivering really fast.

Kan Yilmaz:
So that's the habit that I have seen quite often, and I would recommend tackling one by one going for the lowest hanging fruits and making sure, depending on what stage your company is, aiming that number to be in the correct healthy area. Rather than randomly trying out different processes and being more ineffective in that whole process.

Junade Ali:
Yeah, I find this whole area actually of balancing these competing interests of risk and reward really quite fascinating. And part of this probably, because in a prior life, I used to be a road traffic software engineer. I was looking, a little while ago, at a project called Tokeneer, by Outram. And basically, the NSA went to them and said, can you build this high reliability biometric system. And they went through a very, very rigorous process, how they would capture the requirements. The formal specification process in the Z language, they would then write the software in the SPARK Ada programming language. They would do things like, be able to formally verify the software using contracts.

Junade Ali:
This level of assurance goes well beyond what you'd ordinarily see even in a lot of high reliability areas, but certainly in the scope of the code we would ordinarily write for different types of tools, or different types of web apps. Because the reliability requirements are so much higher, because software usually written in languages like SPARK for aviation, things like that. But even with all those practices, when Outram open source this codebase, and they asked academics to look for bugs. The academics were still able to find minor defects in the software. Just because where we're at with computer science basically doesn't allow us to fully make completely reliable software. Even with the very best practices, even with the most skilled software engineers, there will always be these types of issues. And of course, you have to balance the effort you put in against the the risk the software has. So it's balancing these competing interests of risk and reward.

Junade Ali:
And I think this ties in with this idea of Mean Time To Recovery, often what customers care about most is, customers will care about, how long does it actually take to get things back in a usable state after something has happened? Because it's impossible to see every eventuality, so this kind of ties us back with this MTTR idea. Julian, I was wondering if you could kind of close the loop and tell us a bit about why from your perspective, customers seem to care so much about this concept of the MTTR? And what this metric really encapsulates.

Julian Colina:
Yeah, I think Elon Musk says it pretty well. Excuse me. If you're not failing you're not innovating fast enough. So like Kan mentioned, it's nearly impossible to get 100% reliability. If you want to do that, then you never launch any features. So you can get 100% stability, and that's why you have feature freezes. When everybody goes away for Christmas, like, "Software team just stop pushing to production." And then you get 100% reliability. But ultimately, that's not what we want. So when it comes to MTTR, it's a different view on something like an uptime. So a lot of teams focus on these three nines, these four nines, five nines, if you're crazy. But at the end of the day, is that the most healthy way to look at quality? Because somebody who's incentivized to have three nines, four nines, is going to basically be the gatekeeper to production deployments. It's in their best interest to never deploy, to never innovate.

Julian Colina:
Now, if you look at a team who's focused on MTTR, which is responsiveness. When something goes down, or you have an issue in production, we can quickly fix it. Then what that enables is alignment from both sides. From the product engineering, and also infrastructure DevOps platform to work together. To actually launch as quickly as we can, deploy quickly, but reliably. So if something goes down, we sort of focus on how quickly can we get that thing back up. And that produces a healthy view on quality, even more so than the standard metrics that we see today like uptime, for example, which misaligns team priorities.

Junade Ali:
For sure, and I think as well, one of the big components of MTTR is either you've got that detection process as part of that, and then you've got the risk restoration component. And I think often, a big area which is often unseen by engineering teams is often they first need to get that detection phase right, whether it's testing before deployments, or whether it's actually doing end-to-end testing in production, or whether it's having a robust kind of customer support system to have that feedback loop. So I guess from that perspective, it really, really tests the integrity of an engineering team. In terms of how well they interface with customers, how well they're able to measure and detect these types of things.

Julian Colina:
Yeah, definitely. You mentioned, like 20, I'm trying to remember what they are. But like alerting for when something goes down, CI/CD to detect build fails. Everything that comes before that, that failure or that deployment. If we're looking at Change Failure Rate, and MTTR, both of these quality metrics, then we will inherently build tooling and processes to support improving on those metrics. And ultimately, what that leads to, is a much more healthy engineering flow as a whole. So we'll all of a sudden start investing in these developer tools that truly help the developers do their job better. [inaudible 00:28:51] deployments production, the unit tests, the automated test suites running when you push to staging and creates an environment on which they can test with reliable data. All these things end up getting pushed to the side when you don't have these higher level metrics.

Julian Colina:
So let's say you're a team that focuses on uptime, what point are you going to start building this tool? Are you going to wait for 80% of the engineers to basically complain about it, and it's going to be sort of this fluffy claim that, "Oh, we need more tooling to get these things out faster." And a lot of people who have come to Haystack, are in this position where they want to invest in the engineers, and tooling and trying to get the value creation a lot smoother and faster. But without these high level North star metrics, it's really difficult because it's this big intangible thing that is really hard to determine the ROI, which ends up being tremendous.

Junade Ali:
I could probably talk all day about this because I like it. For a team of 30 engineers, small tweaks like limit work in progress, actually having a deployment buttons where a developer can push, these things equate to much faster delivery. And at the end of the day, we're seeing millions of dollars for just for a team a size of 30, in terms of value creation that they're able to deliver, at the end of the day with these small tweaks.

Kan Yilmaz:
I'd also want to add a few things there. So I had a conversation with a customer who came to Haystack, and they were actually trying to do the Netflix Chaos Monkey. They are around 15 to 20 engineers. Just to recap what Chaos Monkey and what Netflix does is, they randomly try to do... They have like some kind of automated system, which tries to break certain parts of their system. So they will see how fast the whole system can recover quickly. So this is the MTTR. They're trying to optimize MTTR in such a high level, they have automated tools to destroy the host on break production in some certain way. And then they can fix it really, really fast. This is like next, next level. And Netflix is global. They're in like more than 100 countries, as far as I know, and you're a 15 engineer team. And if you try to implement a system like this, you're probably not focusing on something which might be more effective. Probably 99.9999 availability for your company is not as important as Netflix where they have millions of customers, the reward function is significantly different.

Kan Yilmaz:
I also want to add MTTR is a high level metric that we say in this sense of deployment or incidents that are happening in deployments. But this concept is really important. If you look at the positive sides, you will get a number, but you don't have that much incentive to optimize it in the sense of, okay, our availability is 99.99. Okay, that gives you a number, but if you look at it from the reverse side... Let's change this into another number, which is how long have we had no incidents? We didn't have any incidents for the past 23 days, or 42 hours, whatever, you get this number. But if you flip it into negative, how fast can we recover? Two seconds? Or is it one hour? Or is it one day? Suddenly, it becomes significantly more action when you can actually aim for a proper goal.

Kan Yilmaz:
So the positive numbers, the negative number, in these kinds of metric tracking, I would recommend negative number and trying to decrease the amount of negative there is. In this case recovery time try to decrease the number so the negativity is not high. I'm going to track how long have we been available since the past deployment. That number will not sustain, you will eventually fail. The important part is not not failing, it is how fast can we recover afterwards.

Kan Yilmaz:
I also want to add one more layer on top of this. We are talking on North star metrics, that's great. MTTR and Change Failure Rate. But once we go to leading metrics, you can actually use the same method. Almost all engineering teams have the trunk-based development nowadays, at least, elite teams have it. Trunk based development is a single branch where everything gets merged, and that branch is then forked into release branches. So coming back to the trunk methodology, if the trunk fails, then any engineer who's working on it, they try to create a new branch and then implement a new feature or so on. They will pull a buggy version of the system. So what will happen, they'll spend a lot of time figuring out, "What the heck is going on? Why is this not working?" And they won't be actually working on the task that they should have focused on, they will try to fix what's going on there. And this is not the single engineer, a lot of engineers are working on this trunk. And if assuming you're like 200 engineering team, at that moment, maybe five engineers are trying to figure out what's going on with the trunk. And there might be an issue there.

Kan Yilmaz:
So you can actually use MTTR in the trunk-base and the trunk branch as well. This is a leading metric. It allows you to unblock your team. Incidentally this is not a leading metric for quality depending on how you use it, depending on how do you do releases. But assuming it's about speed, then if the trunk is down, you cannot iterate fast. But if you can recover trunk really fast, if you have tooling to figure out that, then you can actually use that metric as, okay, our deployment pipeline or our production pipeline is basically blocked, we need to fix it. And it takes around, let's say 15 minutes to fix it, 12 minutes to fix it, two minutes to fix it. If you have proper CI/CD. If you have proper automation, you can decrease it to below one minute, but eventually the pipeline will fail sometime or you'll do some iteration in a few years. It will go back up again. But you can use the same metric in different places. It's just not incidents. You can use it on any kind of flow where a lot of people are dependent on it.

Junade Ali:
That makes sense for sure. And I think one of the areas which will be useful for us to drill into a bit more is, Kan you really introduced this idea to me as well. Something I've been looking at before, but you've also drilled into it as part of the work you're doing on Haystack. And this idea is how you track less severe incidents. So specifically things like number of bugs, or things which are reported, and the dimensions you specifically do on that. So how you can make sure that your customers are satisfied in restoring less severe incidents.

Junade Ali:
The reason I find this particularly interesting is one of the things I learned when I was analyzing customer support data, when I was in the support operations world. Is that one of the things which is really, really critical for customer support is, the bug resolution time. It's basically inversely correlated, to the customer satisfaction. I think the exact metric is median full resolution time for the customer support inquiry, that's inversely correlated to customer satisfaction. And the quicker you're able to resolve these customer issues, the happier the customers are. That is up to a limiting factor beyond a certain point, it becomes important instead perhaps to focus on other areas. To focus on how you interact with the customer, the standard customer support staff and things like that. But what I was particularly interested is that major limiting factor is often actually how long it takes to resolve something.

Junade Ali:
So I'm curious if you could speak a little bit about the work you've done on the engineering side as to how you take that finding and you implemented for the actual product engineers, you have to fix these bugs.

Kan Yilmaz:
Certainly. So this actually ties back to what people come for. And Julian mentioned, one thing that they do is, they track number of bugs as quality metric. So is number of bugs a good quality metric? The answers is, it depends. So let me go into how we should actually look into this. If you allow any engineer to tag anything as a bug, then what will happen is they will tag anything that is not expected in their normal flow. So engineers, they have production releases, but they also have development environments. So let's say they're working on a pull request, and there is a bug in the pull request before it's all merged into the trunk or before it's released. They tag and other pull requests as a bug and fix them. So is that actually a real bug? It wasn't in production. Did you actually... Can you count that as my quality was low or high? How do you get it? If you tried to track number of bugs from the commit tag title, what will happen? In normal engineer, while they're developing feature, they'll add like fix bug, probably 30% or 20% of the time. That doesn't mean that you're actually fixing bugs, that means that you're just delivering a feature where the engineer just tagged as bug.

Kan Yilmaz:
How can we make sure that this metric is high signal, it can only be done by the customers. The customers know what actually matters. If you have, let's say, a system, and a certain part, nobody uses it, it's basically not generating revenue for the company, not generating value for the customers. Even if there's a bug there, the customers will not report it, because they're not using it. So let's say you don't need to fix that, you can actually completely get rid of that whole system, and you'll still be okay. That is not a signal. But the customers will make sure that if they report a bug, they care. They care, not just like, "Oh, this doesn't work, close." They care enough that they will submit a bug to you. That's the highest signal that you can get.

Kan Yilmaz:
And that is the best method that I have seen on how to track bugs. So you need to get that number from the customers. You should not develop it internally by a product manager, by an engineer, or any kind of automated tool which scrapes and the sentiment analysis or any kind of NLP and so on. Those don't work. Customers, that works.

Kan Yilmaz:
But how can we actually make this if we go one step further. Let's say you have number of bugs, you track it from the customers, track the number of submitted bugs from the customers. You have a number, you have 100. So is 100 good? Is 100 bad? What do I do? You can't answer these questions, you still are missing information.

Kan Yilmaz:
Okay, let's try to fix this. There are two dimensions that actually allow you to make this more actionable. One of them is priority. So let's say we have 100 bugs, but two of them are on high priority and the rest 98 one are low priority. You probably will not even fix a low priority ones because they're just not valuable to the customers that much. They definitely did submit a ticket, but it's not that valuable. It's tagged as quite low. But the two big ones that's really important. So you can actually have this prioritization as a dimension, so you can make it more actionable.

Kan Yilmaz:
And you can actually do this, the next dimension is teams. If you tag it by teams, so front end team, versus back end team, or mobile team and so on. Then you can actually see okay, the back end team has three high priority bugs and 25 low priority bugs. But then you look at the front end team, they have 15 high priority bugs and two low priority bugs. The total number of that the back end team looks significantly worse, but in reality, the front end team actually needs to improve their quality processes, because they have 15 bugs as high priority. So this allows you to, how do I allocate resources? Which team actually needs help? Which part of the codebase is actually higher quality? Which process do I need to improve? You can get all these actionable data, but you need to make sure that the customers submit the bug tickets, you track it by priority, and team has dimensions.

Kan Yilmaz:
One thing to note is I have seen a few companies who will try to do this, but they failed to link the support tickets into a Jira ticket. So there were some support tickets, which were not tagged. Suddenly, the metrics are flawed. You do get some insights, but maybe in that case, the support team had not tagged high priority bugs, 80% of the high priority bugs, and suddenly you're not tracking properly. You can't make correct decisions based on incorrect data. The accuracy level needs to be high. So I would also encourage people to have 100% coverage by each support ticket to a number of Jira tickets or task management issues. To recap, the customers give the support ticket, they submit it. The support team, they tag every single ticket into an issue, tag them by priority or team. Then the developers go and focus on the biggest ones, they fix it.

Kan Yilmaz:
And if you actually do process improvements, this metric that these are... This tracking that you do, they might actually be similar to Change Failure Rate in the sense of you can use it as a high level metric because it does capture quite a few of the whole process, which is, if there is a bug, you know that customers are complaining, if there's downtime, they will complain. Change Failure rate captures this number of bugs, if you track it properly, it captures it. It also captures a bit of more of the spectrum where it goes all the way to less important bugs, and how can I make it more actionable. So you can use it in a lot more different ways. Change Failure definitely is a really good metric, I highly encourage everyone to track Change Failure rate. And if you have enough resources to build this whole system that I just described, on how to track bugs, you can track that alongside Change Failure Rate. Change Failure Rate measures, the high incident ones really, really well. Whereas number of bugs, tracks the spectrum in a wider spectrum of the quality of your application.

Junade Ali:
Awesome. And I think one of the things, which is really critical. Especially when you're measuring, for example, you don't that median full resolution time I spoke about earlier, that MTTR. One of the things which is particularly powerful when you're getting customer requests, is you often find that if there's one bug you're tracking in your project management system, that may be linked to multiple customer support tickets. There could be an enterprise customer complaining about this. There could be a diverse set of customers that could matter differently to each of them separately. And in order to drive down that MTTR, that median full resolution time down further. If you often focus on you the bugs which have the widest area of focus, and what I mean by that is, the biggest impact of the customers as well as, there are multiple people reporting them. So you've got one bug which affects 10 users, another which only affects one. If you go after the bugs that affects 10, you've closed 10 support tickets, right? And because of that your median time is driven down a lot further, you get bigger impact on that metric if you're able to prioritize by use.

Junade Ali:
So, I guess there's another component to this, which is really how you encourage customers to be able to solicit that feedback. And also how you make sure that you prioritize it when you when it comes in. Julian, I was wondering if you could talk a little bit about your kind of experience in that area and how you really build that type of relationship with your customers?

Julian Colina:
Well, for us, we really focus on I suppose you can call it MTTR, from a product or sales standpoint. It's okay to make mistakes at Haystack, but we try to solve them as quickly as possible. And I think, for the most part, a lot of the sales advice is the same thing. If you were to say the wrong thing, or maybe the implementation goes wrong, right up front, the most important thing is to catch it early to fix it quickly. Mistakes can be made, and it's all about how quickly can we actually help resolve that for the customer. So with Haystack, we have like a whole Slack community for every single user, they can talk to me almost too excessively, within that Slack channel. Literally any time of the day, you can hit any one of us in Slack, actually, because we're always in there as well. So this close connection to the customer is super important.

Julian Colina:
Obviously, there's like bug tracking systems that we talked about already, which can help track bugs, who they affect, and all these types of things. I don't want to get too far into the weeds about that, because it's largely on the product side of the business, as opposed to the engineering. So what we really look for is these aggregate metrics as time goes on. So what is that Change Failure Rate? What is that MTTR? And are we improving overall? Now, when we get down to the day-to-day, it's just about, we see an issue, we fix it for the customer. Just to focus on the customer is really how we do it internally at Haystack.

Junade Ali:
For sure. And I guess, bringing that down to this equation between risk and reward. When you're building a product if we use the Netflix example. If after watching a movie for about five minutes, a rating feature doesn't work, and I'm not able to rate it, that's a lot more palatable for me, than if my film doesn't play at all. And there are other kinds of products and services, which are even more mission critical. Where reliability and integrity is so much more important in various different industries. And in those products themselves are features you care about and other features you don't. So there's really this effort of this balance between risk and reward, not just within the product process, but within that engineering process, and deciding where you actually focus your efforts.

Junade Ali:
So I guess it's a question which is open to the floor, really. But what kind of best practices have you guys seen in that? I guess it's something you can't really measure. And it's something you really, really need to have that engineering, that open frank communication in your engineering org. To be able to do that, you need to have these things built into your specification process and things like that. What are your thoughts on that balancing act, really?

Julian Colina:
I think the balancing act is particularly interesting, and it's something that is so hard to strike the right balance. I think that's where a lot of teams actually struggle. So to try to tie it home to like a real world example, where this has been super successful, is actually from lean manufacturing. Which anybody listening, that's a lot of words. Lean methodology comes initially from Toyota. Back in the day before Toyota transformed what lean manufacturing looks like, it was all about how many days since the last incident. How reliable can we get this entire production line. And that's how they thought about efficiency and productivity. And so Toyota came through and said, "Well, we're building all these things that are hyper reliable, but they're not flexible at all." So they came up with the idea of an Andon Cord, which is whenever somebody sees an issue, you pull the Andon Cord, and you immediately go fix it. Production line stops. Which was almost blasphemous at the time, because that means your entire production line stops. That's crazy, we need to keep building this feature funnel of pumping out cars or whatever have you.

Julian Colina:
And Toyota basically said, whenever there's an issue, you pull the Andon Cord, we fix it. And what that ended up leading to was one of the most agile manufacturing companies in the world. And just like we talked about speed versus quality, Toyota was able to do this because of this Andon Cord. The quicker you can find issues, the quicker you can fix them. And of course, that led to increased profitability, market share, customer satisfaction. I mean Toyota as who they are today. And ultimately, it led to some of the ideals that we strive for in software as well. Lean methodology and trying to balance this aspect of risk versus reward. I think they go hand in hand. When you see an issue, we should try to fix it. And I think that's where Haystack fits, is we can start to evaluate how big of an issue is this. Sometimes you pull that Andon Cord. And it's exactly not worth fixing right now, and that's totally okay. But it's all about this concept. And having that Andon Cord to pull , a lot of teams don't have that.

Kan Yilmaz:
I'd also want to look into this from a different perspective. There are a lot of conversations on, is monolith better? Is microservices better? Do we decentralize or centralized processes, teams and what's going on? One thing that I can recommend is, this is actually from one of the pioneers in production methods, pipelines. His name's Eliyahu Goldratt. He has an amazing book called The Goal. So in his book, in his methodologies, what he explains is, if you give a metric to a team who does not 100% own it, they will completely ignore it, they won't take action. If you say," Okay, engineering team, our monthly active users went down." The engineering team is like, "So what? I don't know, I don't make the decisions on the product changes. I don't have any kind of responsibility there. Why do you measure me on that?" So you need to measure each team, on with the metrics that they own, that they have 100% control over.

Kan Yilmaz:
Coming back to this best practices. Okay, Netflix had a rating team, and they have the whole system that is running on Netflix videos, and so on the CDN's and everything. If each team is responsible of their own product, then you can ensure that they will try to optimize their KPIs. And if they're behind, the whole organization can actually provide them more resources, so that they can achieve what those numbers should aim for. So in that sense, making sure that teams themselves are responsible for their own systems, is a good practice to go forward.

Kan Yilmaz:
But then we have issues such as who actually connects all of these teams and their centralized processes, some of them will be really expensive to build it in a decentralized manner. That's a bit harder question, it depends on the skill that you have. So there is no concrete evidence or answer that I can give right now. But one thing that I can say is, if a team tracks their own metric, if they're solely responsible of their own mistakes, and you make them responsible, you talk about these every single week. Then they will actually understand what is important, what is not, they will make sure that those metrics are correct. And basically let the whole company move forward in a much better sense.

Junade Ali:
This actually reminds me of a... From when I say actually when I was at, again, road traffic engineering world. Sadly, in that industry, often the North Star for the for reliability or safety ends up actually being the number of lives lost per 1000 miles driven or whatever else is. And the interesting thing is, I remember there's one example, the UK where we are lucky enough to have... Their some Nordic countries and Switzerland as well. Usually Norway, Switzerland, the UK, usually tend to be in the top three position for road safety in the world, but one of the systems in the UK that achieves that is smart motorways. This system basically works by having intelligence inbuilt into the motorway network so that you can react. There's a system called Midas, Motorway Incident Detection and Signaling. And basically automatically intervenes to improve the safety of a road if a car breaks down in the fast lane or something like that.

Junade Ali:
And I remember when they rolled out one particular form of the smart motorways, in this example, all four of the motorway lanes would be running so that there would be no hard shoulder. There would be an increase in the casualty rate, because there would be in one particular area, if a vehicle stopped dead in one particular lane, then there's more likely to be an accident. But in fact, the overall system because of all the different optimizations would be in a much healthier position. And ultimately, there was publicity kind of outcry in the media because of this one area of safety being degraded, but even though the overall system was better as a whole.

Junade Ali:
So I'm wondering, how do you go about doing that? Because often if a team is responsible for one particular KPI, you often have to make sure that you don't lose sight of that overall North star. And I think it's not just in very high reliability engineering disciplines. More and more people are dependent on the internet, as we've seen during the past year or so. And tools we're building have a very, very profound impact on the world. We have to think those things through, not just from a product standpoint, but also it is fundamental to our businesses being able to grow. Them having the reputation, the trustworthiness in the industry. So how do you really go about balancing between those two metrics versus the overall quality of the work that your whole engineering org is producing?

Kan Yilmaz:
Go ahead, Julian.

Julian Colina:
I was just going to say, I think you have to do both. You measure at the organization level to make sure that you're improving, but then each team sort of has that as well. In the same way that every other department sort of boils up their high level KPIs and OKRs to the organizational level. So the entire department falls under a few set of KPIs, and those sort of trickle down to everybody else, so that we can all align on these North star metrics. I think the same thing happens for organizations versus teams as well. So a team has these sort of micro optimizations, these local optimizations that they can make.

Julian Colina:
Typically, that's like the leading metrics that we talked about. What your code review practices look like? Are you having too much work in progress? So these micro processes that you can change. But from a global perspective, we have these DevOps teams, these platform teams, these developer experience teams. That can really focus across many different engineering teams and help keep an eye on what that organization as a whole looks like. So then, along with a director of engineering, head of engineering, VP of engineering, generally the one's looking at these metrics to ensure that we're all growing and leveling up, as a whole as an organization. And then we can dive into each particular team, which is likely to have different issues. The mobile app team is never going to start working on the CI/CD pipeline for everybody else, that's mostly going to be on the platform. So different scopes for the same problem, for the same core North star metrics.

Kan Yilmaz:
Yeah, I was actually going to say, basically the same thing. Actually, Google has made this more public, which is OKRs, Objective, Keys and Results. And what happens is you have a company OKR, you try to hit that. You have each departments OKR, you try to hit those which will affect OKR's of the company itself. Then you have each department have different teams, and each team will decide, "Okay, how does my objectives actually hit my department's goal? Then you start your own objective, then you have your key results, and it basically boils down. So anything that you do, as a team, will actually affect all the way up top to the organization's objectives. Where the sales people, the marketing team, and all the other teams are also working on it. And their whole objective is the same objective, but we divided into sub task for each department. And we can tackle our own problems, our own goals, and still affect the organization goal. Which should be on how we...

Kan Yilmaz:
This is actually how I think we should move forward. We shouldn't act completely by ourselves, and focus on completely separate parts of the organization without having a clear goal. Goals make us move faster, they make the whole teams aligned. And you can basically be more competitive in the market, if you have better goals aligned. Compared to someone who doesn't have goals and runs around like a chicken, trying to figure out different things. All departments are completely chaotic and individual.

Junade Ali:
Yeah, that makes sense for sure. And I guess, at software level this individuality is like... We talk about CFR, MTTR. If you're shipping a mobile app, and you can only update once a week, you have to have a slightly different focus on CFR to MTTR because of the recovery. If you're shipping an IoT product, and it's in hardware, again, very different relationship. As opposed to if you're doing something which is just an API service, (laughs) or not just an API service, but if it's an API service, you have that greater level of flexibility there.

Julian Colina:
For all you building "just an API service". (laughter)

Junade Ali:
Awesome. So I think we've we've covered this topic really, really conclusively. So thank you. Thank you both for your time. And I think just to wrap us up, I guess there's one element which is where people can find out more information about Haystack, get in contact. But also if there's any anything else either one of you wanted to add to this discussion before we close it off.

Julian Colina:
No, if anybody has any questions or wants to reach me at any time, my first name Julian@usehaystack.io. I literally check every email, so if you want to reach out, feel free.

Junade Ali:
And Kan your email is Kan@usehaystack.io. And equally you're quite active on the Haystack community as well. And you can find all three of us there as well. Awesome. Well, thank you very much to both of you for taking your time today. And we'll see you later then.

Julian Colina:
Awesome. Thanks for having us.

Kan Yilmaz:
Pleasure to be here. Nice talking to you. Take care.

Junade Ali:
Thank you for joining me for another episode of the Engineering Insights podcast. This podcast has been recorded and produced in Edinburgh, Scotland. I've been joined remotely by Julian Colina from California in the United States. And Kan Yilmaz from Singapore. The soundtrack used for this podcast has been Worq, spelt, W-E-R-Q, by Kevin McLeod. Learn more about Haystack Analytics at usehaystack.io.

About the author

Our latest news

Discover our latest articles, feature releases, and more!

Ready to get started?

Start Free Trial