What is MTBF?

Nothing in the world of reliability engineering is probably more ubiquitous, nor more vexing than the concept of MTBF. You will find it everywhere. Sometimes big large numbers. Sometimes small. And over 25+ years, I’ve seen numerous interpretations of it, and unfortunately many of those are wrong, or at the least, misleading. Let’s dive in to what it is, what it means, and what it really tells you.

MTBF, like so many other things in engineering, is an acronym, in this case, for “mean time between failures.” It’s a a population and system metric, really intended for something complex, that is, something able to be repaired. In simplest terms, MTBF is calculated by taking a total time in service for a system (or group of systems), and dividing by a number of repair events. Seems simple. If for example you had a ship that was operating for 21,000 hours out in the water, sailing away like Christopher Cross, and it had 3 repair events (unplanned, not preventative maintenance), then we could say the ship’s MTBF was 7,000 hrs. I think most people understand this. And all the numbers feel right. They line up well. It sounds like our ship might have been in port for a repair maybe once a year. Something like that. So all is well?

Not really. Almost all examples of MTBF look like this ship example. And it’s great in concept. A ship is, after all going to be assumed to be repaired. Short of sinking, pretty much anything can be repaired more cost effectively than replacing the entire ship. You never get into a scenario where numbers don’t necessarily make sense to you. So let’s look at something different. IEEE 493:2007 gives a table in chapter 10 for expected MTBF values for planning purposes in electrical distribution systems. This table is based on surveys of actual performance of equipment in the power distribution industry (think your electrical utility Con Ed, or similar). Let’s consider the example of the humble circuit breaker from IEEE 493; specifically a 3-phase, 600V, <600A vacuum breaker. IEEE 493 gives the MTBF for the breaker of that type as 3,114,792 hrs. At this point, anyone first learning about MTBF will say something like “so it will never fail” and illustrate the first common trap in understanding MTBF.

What should be obvious is that the breaker is not actually going to last 355 years. So what does 3M+ hours mean? Well, it’s really just the reciprocal of failure probability, or rather, failure rate (per hour in this case) for the population of breakers of the type during a useful period of lifetime. The number in hours is determined by taking a population of product, and dividing by the number of repair events across the entire population. Mathematically, the same as the ship we talked about earlier. Stated differently, if we had 3,114,792 breakers of this type installed out in the world, we’d expect 1 breaker somewhere to fail, every hour of every day during our period of interest. This leads to a key understanding: MTBF is relative to a timeframe of interest.

MTBF does not speak to lifetime, especially when cataloged in the manner above. MTBF is just a calculated statistic for a moment in time. This dramatically complicates understanding. Context is key! In the breaker example, the context provided in the IEEE document is 355.6 of combined unit-years. IEEE also reported that the minimum in-service time for components was 5 years of data. This is a key component here. We can reasonably infer near term things will be about the same, but if we then take a 30 year old system, we do not have data to illustrate MTBF will match, and cannot predict that far away from our survey data period. This is why I tend to think about it’s reciprocal (failure probability). It helps me understand it better because we can realize that failure probability of a system or product is NOT a fixed value, and therefore, MTBF cannot be fixed. Both change with passage of time. And further, the context of our MTBF calculation matters. A lot. You can calculate a MTBF after a mission time of 1 hour each for 1,000,000 somethings with only 1 failure and claim it’s MTBF is 1,000,000 hours, but that’s really only accurate for that first hour.

If we go back to the ship example, a brand new ship when first launched probably has some issues that need to be ironed out. Often, this might be called a “shake down” period. So in the shake down, if we counted the duration of it (say something like 1000 hrs) and we had say 3 issues to resolve, then the MTBF during shakedown was like 333 hrs. After those initial issues get sorted, things approach sort of a steady state, maybe like the example before for 21,000 hrs of sailing with only 3 more repairs. For that period, the MTBF of the ship has increased dramatically to 7,000 hours. And what about late in life, after 40 or 50 years? We probably expect the probability of events to increase as we get to later in life. So what you should be able to see clearly, MTBF is changing during the product lifecycle, as does the failure probability.

Another key mistake I’ve seen get made is to look only at a number of warranty claims for a product, and calculate MTBF without considering all the product that doesn’t have a claim. Let’s look at a new example for a “widget.” We have 100 widgets that shipped on February 1st, 2024. By 12 months later, we’ve had 10 failures with a time in service shown in column E, varying from 150 days to 304 days. Some people will take that average and say it’s the MTBF (222 days). That’s incorrect, though understandable given that, well … it is the average time of the failures which sounds like a MTBF, it’s even the average time to the failures reported. But that average neglects all the other 90 widgets from the population. Again, recall MTBF is the reciprocal of failure probability. That may help you realize that the probability at 1 year can’t be >10%, since only 10% have failed. The correct way in this example is to approximate the in-service time for the other 90 widgets, and add that to the known in-service time for the failures. This is something we do all the time to approximate a real MTBF at a timeframe, or probability of failure for a time period.

Lastly, a word about MTTF. Yikes, it’s another acronym! Well the astute among the readers will note that what is actually calculated in the widget example is MTTF, and not MTBF. The difference is really semantic – we are talking about a repairable system when we talk about MTBF, and a non-repairable system when we talk about MTTF. Or rather, a single component in truth generally has a MTTF, because once failed, it is not repaired. So components within systems are often having MTTF values, but the system has a MTBF. Complicating matters is that often, they get used interchangeably. Why? Well because mathematically, they are calculated the same way. They both describe essentially probability of an event, and the semantic difference is whether it is time between repair events, or time to replacement events.

Takeaways:

MTBF is not a static metric, it changes with passage of time.
MTBF does not speak to lifetime of a product/system generally.
MTBF is just (1 / failure probability) per unit time.
MTBF is a truly a population statistic
Context of MTBF is important, and interpretation outside the contextual bounds can lead to bad conclusions.
If the item is repairable, talk about MTBF, if it’s just replaced, talk about MTTF. But calculate it the same way.

Comments

Leave a Reply Cancel reply