Reliability assessment and mitigation by failure propagation analysis
New requirements (ISO26262, IEC61508) for
electronic system reliability focus on functional safety analysis, which
requires a “vertical approach” to failure analysis.
This
vertical approach corresponds to following the propagation of a failure from
its inception all the way to the impact at end user, through all layers of
component, subsystem and system.
The
vast majority of initial failures will never make it past the component or
subsystem level. They are derated by design, or naturally filtered out: their
propagation stop along the way.
But
some critical errors will propagate to create system failures like a domino
effect. Following the path of these failures is the
challenge that we should be trying to solve. Another even bigger challenge
ahead of us is to make sure that we are exhaustive in this approach: we can’t
miss any of these failures that create system problems.
Why this new approach?
The
issue until now is that communication of reliability performances follows the
segmentation of the value chain. Different types of failures, if not clearly identified,
fall into a wide bucket with the “errors” label, no matter what is the effect at
system level of these errors when propagating.
Furthermore,
traditional statistics are used to report the failure rates representing the behavior
of population in this bucket. Results are usually presented in the form of a
distribution, with a mean and a standard deviation covering 90% or 95% of possible
cases. Unfortunately in failure analysis we are dealing with treatment of
outliers. These failures that are not derated or blocked by the system
represent only few % or tens of % of the total number of events, and these are
the ones that we are interested in. These outliers are in the tail end of the
distribution, and we should design mitigation according to them. So showing reliability
performance as a mean and standard deviation is not the best tool to help analyze
our problem, as illustrated in the following picture (from a previous blog). It is not about the distribution of all failure
modes of a component with respect to a unique spec (“how far are most of my
numbers from the cliff?”), it is about which one of these events will create the
domino effect at system level, or fall off the cliff.
The
approach of following failure propagation recognizes that there can be many
different symptoms of failures. And it is not accurate to consider all symptoms
as being part of the same bucket. Indeed these failures when propagating at
higher level, the subsystem or system, can either be benign or create system
failures. Obviously the circuit designer doesn’t have the visibility and
information to judge about the criticality of a system failure, nor can’t he
analyze the propagation of errors beyond the chip, into the system. Therefore,
they just don’t know how to classify failures, because it is seldom clearly
explained by the system engineers.
What are we missing here?
We’re
missing a methodology to track the failure along each and every layer of the
value chain. Let’s talk about this later….
Another
drawback of horizontal market segmentation stems from the fact that when
there’s unknown, there’s risk. And against perceived risk engineers increase design
margins. You can easily understand how margins
can be inadequately assigned and can pile up for each layer within the value
chain. Here’s another example about a distortion of margin assignments:
When
a device maker asks their chip vendors for the failure rates of their chip, which
number does the vendor provide?.....well, most of the time it’s their best
performance! Of course, this is human nature to want to impress your customers,
even if the result produced doesn’t quite correspond to the current hardware
configuration or application.
Reliability
test can be expensive, so if results exist, it is for one version or
configuration of the device (hardware+firmware). Needless to say, these results
are used as much as possible for many versions of the chip. At this point,
either the system designer uses the data communicated and create a dangerous
situation for his application of
underestimating the risk (in reliability, you want to know the worst case, not
the best performance), or on the contrary he errs towards conservatism and adds
margins on top of the result to play it safe. Imperfect testing will also
report only one number for the whole bucket (including every type of error),
creating a case of maximizing margins for a number of failures that shouldn’t
be of a concern, only when one type of critical failure should be addressed
separately.
In
an ideal world, specifications and requirements should be determined according
to the impact of an error propagating at the system level, or at end customer’s
level. This approach already exists in some cases where major system houses
approached their specification per component according to their system impact
[1] (Cisco’s paper).
Customization
of reliability analysis can be done not only for each product and application,
but actually for each category of fault. With the goal of avoiding over
engineering due to margin pile up and catching the outliers cases that will
wreck the application, reliability analysis should be implementing according to
seriousness of symptoms.
We
are presenting here a model of presenting the reliability situation that would
allow solving the overdesign problem for high reliability applications.
We
define three classes of reliability targets, namely LF (Low Failure), ULF
(Ultra Low Failure) and XLF (Extremely Low Failure). As proposed as an example
in the table below, these three categories apply to the failure rate of the
whole device, to the type of application and specification. These categories
can also be linked to existing specifications or reliability target.
Pushing the model even further, we can define these three categories for error types according to their consequences. So the same device will have different FIT requirements according to the type of failure, especially its symptoms at the system level.
Let’s
take as an example the system being a car, the subsystem being the engine
controller and the component (or device) an MCU inside the controller.
Failures
like radiation based Single Event Upsets occur at different locations inside
the chip. Some of these errors will propagate to the output of the chip, most
won’t.
We can define for each type of errors some criticality level, the same way we did it previously for the whole chip (hence the color codes on different paths: green, yellow, red). It can be for example that SDC (or Silent Data Corruption) have a much tighter FIT requirement than a DUE (Detectable Un-correctable Error) because the latter can be corrected at a higher level in the system.
How do we solve the problem?
How
much can be gained with this approach compared to more traditional approaches?
That’s
the topic of my next blog.
Until
then, what do you think about this approach? Let me know…..
Ref [1]: