Thursday, October 23, 2014

System reliability in Electronics: new point of view

New model to address Electronic System Reliability


Everyone agree that reliability of their "things" is very important,...at least qualitatively. Now how to define reliability quantitatively? what does it mean to be 10% more reliable, or 20% less reliable? Like in many technical and non technical fields, until a metric is defined to measure a quantity or a process, it is not possible to act upon it.

Our lifestyles, the way we work, communicate, move around, have fun depends heavily on electronic systems. In this blog I'd like to present some thoughts on how to consider reliability of these electronic systems (that's what I've been doing for a living in the past several years) that could help system architects have a more systematic approach to the issue.

I am considering here two main risk scenarios. 
- I call the first one"Margin call", and I believe it is mostly used by reliability professionals, 
- The second one is a non linear effect, which, if we think about it, represents well what happens when failure strikes: it is the "Edge of the Cliff" scenario.


In “Margin call”, a drift in parameters value due to ageing or other semi predictive effects, a unlucky combination of corners effects result in a system's performances not meeting functional specifications anymore. This can be managed by reducing the system performance (processor speed lowered from 2GHz to 1.8GHz for example, or increase of power consumption). It might not lead to complete failure, the overall system might still be working, even at degraded performances. Risk is linked to how wide the Gaussian distributions of all operation performances is.

“Edge of the Cliff” effect can be a consequence of letting Margin call situations adrift without corrective action. It can be also a case of Soft Error where location, timing and effects of the event are unpredictable. Whatever the existing performances margins of individual subsystems and components, catastrophic failure can happen.


System reliability engineers seem to handle the "Margin Call" scenario through ageing models, statistical analyses, predictions. There are several tools to do this, including statistical analyses (Monte Carlo, covariance analysis, …) but most of the system reliability engineers are actually using excel as a platform to compute these data.

Dealing with “Edge of the Cliff” scenario consists in chasing the outliers of the distribution of every possible combination of corners, including analyzing the propagation of failures within the system. Once we can describe what these cases are, we can setup design margins to be as far away from the Cliff’s edge as we need.

So the point is trying to find where the edge of the cliff is...and by the way the depth of the precipice is also an important parameter. The perception is that the deeper the precipice (the more serious the consequence of failure), the more distance (margin) from the cliff's edge is needed.


But in both cases the key element for the system engineers is to trust the input data from suppliers. These data can be biased or incorrect or incomplete for many reasons. The main reason is the competitive environment and contractual framework governing relationship between parties and biasing communication:
When a customer asks vendors to provide reliability performance of their components, guess which data they provide?

You got it: the best ones! 

Think about it: is it in the best interest of the system reliability architect? obviously not, it is actually a  dangerous practice. This system architect might as well find himself or herself like the person in red in the picture above: sitting at the edge of a huge cliff, without even knowing it!

How to solve the issue? after all, it is in engineers' nature to show always their best achievements and results (and hide the ones they're not so proud about).

Let me know your thoughts, but my guess is that the solution involves building solid trust among the partners in the value chain. This can work only with effective communication, opening the kimono, and maybe an specific platform for data exchange which show openly  system specifications, and the exhaustive set of reliability data, good AND bad, from the component providers.


Oct 22nd Update...
I find this picture (credit: Jeff Pang. Yosemite Highlining). This might be your situation, and you don't even realize it!