We’ve all been there. You’re on an important Zoom call with your team, and someone uses an abbreviation you’re not familiar with. You’ve heard it, but you’re not quite sure exactly what it means. You want to do a quick Google, but you’re sharing your screen! Ugh.
Let’s pull apart some of these abbreviations for incident management KPIs (Key Performance Indicators). Now, you won’t find yourself SOL at your next Zoom call with the Support team. Oh, by the way, they’re technically “initialisms”; “acronyms” have to be pronounceable (e.g NASA). If you can pronounce any of the initialisms in the title, don’t.
Let’s jump in, FTW!
MTTF stands for mean time to failure. This is the average lifespan of a given device. Mean time to failure is calculated by adding up the lifespans of all the devices, and dividing it by their count.
MTTF = total lifespan across devices / # of devices
MTTF is specific to non-repairable devices, like a spinning disk drive; the manufacturer would talk about it’s lifespan in terms of MTTF.
For example, consider three dead drives pulled out of a storage array. S.M.A.R.T. indicates that they lasted for 2.1, 2.7, and 2.3 years respectively:
(2.1 + 2.7 + 2.3) / 3 = ~2.37 years MTTF
We should probably buy some different drives in the future.
MTTF alternatively stands for mean time to fix, but it seems that “failure” is the more common meaning.
MTBF stands for mean time between failures. MTBF is used to identify the average time between failures of something that can be repaired.
Mean time between failures is calculated by adding up all the lifespans of devices, and dividing by the number of failures:
MTBF = total lifespan across devices / # of failures
The total lifespan does not include the time it takes to repair the device after a failure.
An example of MTBF would be how long, on average, an operating system stays up between random crashes.
MTTR stands for mean time to repair, mean time to recovery, mean time to resolution, mean time to resolve, mean time to restore, or mean time to respond. Mean time to repair and mean time to recovery seem to be the most common.
Mean time to repair (and restore) is the average time it takes to repair a system once the failure is discovered. It is calculated by adding the total time spent repairing and dividing that by the number of repairs.
MTTR (repair) = total time spent repairing / # of repairs
For example, let’s say three drives we pulled out of an array, two of which took 5 minutes to walk over and swap out a drive. The third one took 6 minutes because the drive sled was a bit jammed. So:
(5 + 5 + 6) / 3 = 5.3 minutes MTTR
Mean time to repair assumes the system that has failed is capable of restoration, and does not require replacement. It is synonymous with mean time to fix.
Mean time to recovery, resolution, and resolve is the time it takes from when something goes down to the time that is back and at full functionality. This includes everything from finding the problem, to fixing it. In DevOps and ITOps, keeping MTTR to an absolute minimum is crucial.
MTTR (recovery) = total time spent discovery & repairing / # of repairs
Mean time to respond is the most basic of the bunch. Mean time to respond is the average time it takes to respond to a failure.
MTRS stands for mean time to restore service. MTRS is the average time it takes from when something that has failed is detected to the time that is back and at full functionality. MTRS is synonymous with mean time to recovery, and is used as a way to differentiate mean time to recovery from mean time to repair. MTRS is the preferred term for mean time to recovery, as it’s more accurate and less confusing, per ITIL v4.
MTRS = total downtime / # of failures
MTBSI stands for mean time between service incidents and is used to measure reliability. MTBSI is calculated by adding MTBF and MTRS together.
MTBSI = MTBF + MTRS
MTTD stands for mean time to detect. This is the average time it takes you, or more likely a system, to realize that something has failed. MTTD can be calculated by adding up all the times between failure and detection, and dividing them by the number of failures.
MTTD = total time between failure & detection / # of failures
MTTD can be reduced with a monitoring platform capable of checking everything in an environment. With a monitoring platform like LogicMonitor, MTTD can be reduced down to a minute or less by automatically checking everything in your environment for you.
MTTI stands for mean time to identify. Mean time to identify is the average time it takes for you or a system to identify an issue.
MTTK stands for mean time to know. MTTK is the time between when an issue is detected, and when the cause of that issue is discovered. In other words, MTTK is the time it takes to figure out why an issue happened.
MDT stands for mean down time. MDT is simply the average time period that a system or device is not working. MDT includes scheduled down time and unscheduled down time. In some sense, this is the ultimate KPI. The goal is 0. Improving your mean time to recovery will ultimately improve your MDT.
MTTA stands for mean time to acknowledge. Mean time to acknowledge is the average time from when a failure detected, to work beginning on the issue.
MTTA = total time to acknowledge detected failures / # of failures
Imagine the 100m dash. The starting horn sounds, you detect it a few milliseconds later. A few more milliseconds after that, your brain has acknowledged the horn by making your legs start running. Measure that 100 times, divide by 100, voila, MTTA.
This KPI is particularly important for on-call DevOps engineers, and anyone in a support role. DevOps engineers need to keep MTTA low to keep MTTR low, and to avoid needless escalations. Support staff needs to keep MTTA low to keep customers happy. Even if you’re still working towards resolution, customers want to know their issues are being acknowledged and worked on promptly.
MTTV stands for mean time to verify. Mean time to verify is typically the last step in mean time to restore services, with the average time from when a fix is implemented to having that fix verified that it is working and has solved the issue.
MTTV = total time to verify resolution / # of resolved failures
You can improve this KPI in your organization by automating verification through unit tests at the code level, or with your monitoring platform at the infrastructure, application, or service level.
The main difference between MTTF and MTBF is how each is resolved, depending on what failure happened. In MTTF, what is broken is replaced, and in MTBF what is broken is repaired.
MTTF and MTBF even follow naturally from the wording. “To failure” implies it ends there. “Between failures” implies there can be more than one.
In many practical situations you can use MTTF and MTBF interchangeably. Lots of other people do.
The remedy for hardware failures is generally replacement. Even if you’re repairing a problematic switch, you’re likely replacing a failed part of it. Something like an operating system crash still requires something that could be thought of as a “repair” as opposed to a “replacement”.
MTTF and MTBF are largely the concern of vendors and manufacturers. You can’t change the MTTF on a drive, but you can run them in a RAID, and you can drive down MTTR for issues within your infrastructure.
You generally can’t directly change MTTF or MTBF of your hardware, but you can use quality components, best practices, and redundancy to reduce the impacts of failures and increase the MTBF of the overall service.
Mean time to detect and mean time to identify are mostly interchangeable terms depending on your company and the context.
Detecting and acknowledging incidents and failures are similar, but differentiate themselves often in the human element. MTTD is most often a computed metric that platforms should tell you.
For instance, in the case of LogicMonitor, MTTD would be the average time from when a failure happened, to the time that the LogicMonitor platform identified the failure.
MTTA takes this and adds a human layer, taking MTTD and having a human acknowledge that something has failed.
MTTA is important because while the algorithms that detect anomalies and issues are incredibly accurate, they are still the result of a machine-learned algorithm, and a human should make sure that the detected issue is indeed an issue.
Mean time to failure typically measures the time in relation to a failure. Mean time to repair measures how long to get a system back up and running. This makes for an unfair comparison, as what is measured is very different.
Let’s take cars as an example. Let’s say your 2006 Honda CR-V gets into an accident. MTTF could be calculated as the time from when the accident occurs to the time you get a new car. MTTR would be the time from when the accident occurs to the time the car is repaired.
MTBF and MTTR are related as different steps in a larger process. MTBF measures the time between failures for devices that need to be repaired, MTTR is simply the time that it takes to repair those failed devices. In other words, MTBF measures the reliability of a device, whereas MTTR measures the efficiency of it’s repairs.
Mean time to fix and mean time to repair can be used interchangeably. The preferred term in most environments is mean time to repair.
Mean time to restore service is similar to mean time to repair service, but instead of using the time from failure to resolution, it only covers the time from when the repairs start to when full functionality is restored.
In general, MTTR as a KPI is only so useful. It will tell you about your repair process and how efficient it is, but it won’t tell you about how much your users might be suffering. If it takes 3 months to find the broken drives, and they are slowing down the system for your users, 5.3 minutes MTTR is not useful or impressive.
Typically, customers care about the total time devices are down a lot more than the repair time. They want to be down as little as possible. For the sake of completeness, let’s calculate this one too:((5 + 5 + 6) + ( 3 + 3 + 3) ) / 3 = 8.3 minutes MTTR
In general, the MTTR KPIs are going to be more useful to you as an IT operator.
When an incident occurs, time is of the essence. With these KPIs, you can get better insight into your remediation processes, and find areas to optimize.Unfortunately, because of the subtle similarities of each KPI, many of their meanings differ from company to company. If these initialisms come up in a meeting, I suggest clarifying the meaning with the speaker. Otherwise, you might be DOA.
Michael Rodrigues is an employee at LogicMonitor.
Subscribe to our LogicBlog to stay updated on the latest developments from LogicMonitor and get notified about blog posts from our world-class team of IT experts and engineers, as well as our leadership team with in-depth knowledge and decades of collective experience in delivering a product IT professionals love.
LogicMonitor announced the appointment of Nitin Navare as Chief Technology Officer (CTO).
There are a few Agile certifications available to choose from, and in this article, we’ll discuss the best agile certifications currently available for IT professionals.
Join LogicMonitor Wednesday June 1st for Dinner @ Frankie & Johnnie's Steakhouse