Thursday, October 8, 2015

Reliability assessment and mitigation by failure propagation analysis



New requirements (ISO26262, IEC61508) for electronic system reliability focus on functional safety analysis, which requires a “vertical approach” to failure analysis.

This vertical approach corresponds to following the propagation of a failure from its inception all the way to the impact at end user, through all layers of component, subsystem and system.
The vast majority of initial failures will never make it past the component or subsystem level. They are derated by design, or naturally filtered out: their propagation stop along the way.

But some critical errors will propagate to create system failures like a domino effect.  Following the path of these failures is the challenge that we should be trying to solve. Another even bigger challenge ahead of us is to make sure that we are exhaustive in this approach: we can’t miss any of these failures that create system problems.

Why this new approach?


The issue until now is that communication of reliability performances follows the segmentation of the value chain. Different types of failures, if not clearly identified, fall into a wide bucket with the “errors” label, no matter what is the effect at system level of these errors when propagating.

Furthermore, traditional statistics are used to report the failure rates representing the behavior of population in this bucket. Results are usually presented in the form of a distribution, with a mean and a standard deviation covering 90% or 95% of possible cases. Unfortunately in failure analysis we are dealing with treatment of outliers. These failures that are not derated or blocked by the system represent only few % or tens of % of the total number of events, and these are the ones that we are interested in. These outliers are in the tail end of the distribution, and we should design mitigation according to them. So showing reliability performance as a mean and standard deviation is not the best tool to help analyze our problem, as illustrated in the following picture (from a previous blog).  It is not about the distribution of all failure modes of a component with respect to a unique spec (“how far are most of my numbers from the cliff?”), it is about which one of these events will create the domino effect at system level, or fall off the cliff.



The approach of following failure propagation recognizes that there can be many different symptoms of failures. And it is not accurate to consider all symptoms as being part of the same bucket. Indeed these failures when propagating at higher level, the subsystem or system, can either be benign or create system failures. Obviously the circuit designer doesn’t have the visibility and information to judge about the criticality of a system failure, nor can’t he analyze the propagation of errors beyond the chip, into the system. Therefore, they just don’t know how to classify failures, because it is seldom clearly explained by the system engineers.

What are we missing here?

We’re missing a methodology to track the failure along each and every layer of the value chain. Let’s talk about this later….

Another drawback of horizontal market segmentation stems from the fact that when there’s unknown, there’s risk. And against perceived risk engineers increase design margins.  You can easily understand how margins can be inadequately assigned and can pile up for each layer within the value chain. Here’s another example about a distortion of margin assignments:
When a device maker asks their chip vendors for the failure rates of their chip, which number does the vendor provide?.....well, most of the time it’s their best performance! Of course, this is human nature to want to impress your customers, even if the result produced doesn’t quite correspond to the current hardware configuration or application.
Reliability test can be expensive, so if results exist, it is for one version or configuration of the device (hardware+firmware). Needless to say, these results are used as much as possible for many versions of the chip. At this point, either the system designer uses the data communicated and create a dangerous situation for his application  of underestimating the risk (in reliability, you want to know the worst case, not the best performance), or on the contrary he errs towards conservatism and adds margins on top of the result to play it safe. Imperfect testing will also report only one number for the whole bucket (including every type of error), creating a case of maximizing margins for a number of failures that shouldn’t be of a concern, only when one type of critical failure should be addressed separately.

In an ideal world, specifications and requirements should be determined according to the impact of an error propagating at the system level, or at end customer’s level. This approach already exists in some cases where major system houses approached their specification per component according to their system impact [1] (Cisco’s paper).

Customization of reliability analysis can be done not only for each product and application, but actually for each category of fault. With the goal of avoiding over engineering due to margin pile up and catching the outliers cases that will wreck the application, reliability analysis should be implementing according to seriousness of symptoms.

We are presenting here a model of presenting the reliability situation that would allow solving the overdesign problem for high reliability applications.

We define three classes of reliability targets, namely LF (Low Failure), ULF (Ultra Low Failure) and XLF (Extremely Low Failure). As proposed as an example in the table below, these three categories apply to the failure rate of the whole device, to the type of application and specification. These categories can also be linked to existing specifications or reliability target.



Pushing the model even further, we can define these three categories for error types according to their consequences.  So the same device will have different FIT requirements according to the type of failure, especially its symptoms at the system level. 

Let’s take as an example the system being a car, the subsystem being the engine controller and the component (or device) an MCU inside the controller.

Failures like radiation based Single Event Upsets occur at different locations inside the chip. Some of these errors will propagate to the output of the chip, most won’t.

We can define for each type of errors some criticality level, the same way we did it previously for the whole chip (hence the color codes on different paths: green, yellow, red). It can be for example that SDC (or Silent Data Corruption) have a much tighter FIT requirement than a DUE (Detectable Un-correctable Error) because the latter can be corrected at a higher level in the system.





Now, let’s imagine this MCU within the whole system. Some of the errors that made it through to the output of the chip will propagate (or not) within the engine control unit (ECU) all the way to the system level (car). Criticality of each failure can change from red to yellow or green according to the tolerance for such failure at system level and the mitigation mechanisms in place as a remedy.What we can see here is that among all failures detected at the chip level with different impacts:- seriousness of impact can change when propagating within the system- only a portion propagates to become a system failure.



How do we solve the problem?

How much can be gained with this approach compared to more traditional approaches?
That’s the topic of my next blog.
Until then, what do you think about this approach? Let me know…..


Ref [1]: 

Specification and Verification of Soft Error Performance in Reliable Internet Core Routers, Silburt A, Evans A, and Al, Cisco Systems.