Friday, June 8, 2012

DAC 2012: Interview with EE Times

DAC was quite busy this year at the Moscone center in San Francisco.
It was a good way to test market demand for Soft Error solutions, or at least the interest of different industries about this reliability problem. Whereas many are aware of the problem, more than ever see it as a concern and try to be proactive about it. Summary of concerned markets is shown in this snapshot taken from the graphics on our booth:
We've also been interviewed by EETimes online TV. You can see the it here.

Tuesday, May 1, 2012


The Australian Transportation Safety Board (ATSB) released on December 19th 2011 the final report of investigations of two repeated nose dive incidents during a Qantas Airline Airbus A330 flight from Singapore to Perth in October 2008 (Qantas Flight 72) resulting in an accident. The plane landed at a nearby airport after the incident which caused at least 110 injured passengers/crewmembers and some damages to the inside of the aircraft.

See section 3.6.6 page 143 for discussion about Single Event Effect and Appendix H.
The report is inconclusive about the root cause of the incident. Its origin occurred in an avionic equipment called the TLN101 ADIRU (Air Data Inertial Reference Unit). 
Incorrect data for all the flight parameters were sent by this unit to the other avionics equipments, eventually creating a false angle of attack information misleading the central computer that reacted with a quick nose down maneuver.
The report mentions that probably the wrong signal came from the CPU (Intel 80960MC) inside the ADIRU. Other chips interacting with the CPU and therefore potentially sending wrong signals are an ASIC from AMIs, wait state RAM from Austin Semiconductor (64kx4) and RAM from Mosaic Semiconductor (128kx8).
When skimming through the report and especially the section about SEE, I had few thoughts:
The estimated SEU failure rate of the equipment is 1.96 e-5 per hour, or 19.600 FIT (note: none of the memories were protected by ECC or EDAC). At an altitude of 37.000 ft the neutron acceleration factor compared to ground reference (NYC) is 83x (report data), therefore the equivalent FIT at ground level should be 236 FIT. The order of magnitude seems about right, even though I’d like to have more data about the process node and total size of embedded memory. This FIT rate is just an estimate (from theory, not from test) and seems to take only memory SBU (Single Bit upset) into account.
The investigators couldn’t reproduce the symptoms through test. They focused mainly on neutron testing at 14MeV. I imagine this is because it was a source that they could access easily. Maybe a wider neutron range up to hundreds of MeV (like white neutron spectrum at Los Alamos, TSL or Triumf) would have been more appropriate, especially to create MCU (Multi Cell Upsets). The report states that the rate MCU/SBU is about 1%, so they didn’t investigate further. This depends on the process node! At latest technologies (40nm, 28nm) this ratio can be up to 40% on SRAM.
The components seemed to have been manufactured with older process nodes. But as such, did they check the effect of thermal neutron (Boron 10 was used in older technologies)? Of alpha particles contamination of the package?
I believe that this report needs a little more details on the issue, a little more investigation to try to be more conclusive….Any thoughts and comments?

Thursday, April 12, 2012

Reliability of Cloud Computing

Every second the equivalent of 63 billion CDs of data transit through the world’s internet (source: Cisco). That’s 1.5ZB per year (1ZB= 270 B!). As of December 31st 2011, 2.3 billion persons are using the internet, a 5.3X increase from the previous year! (source: internetworldstat.com). As lifestyle in almost every country in every continent is moving towards ubiquitous mobile lifestyle, demand for remote mass storage and cloud computing capacity is rapidly increasing. These numbers are mindboggling and leave us to estimate the impact of failure of this infrastructure leading to service disruption: we are not talking about thousands nor millions of users affected. We are talking about hundred of millions!
Obviously cloud services architectures involve heavy redundancies, mirror imaging of servers in different geographical location, disaster recovery procedures…Still, isn’t there some single point of failure? As it is a well adopted fact that software can show bugs, viruses, worms…how about hardware? When firewalls, watchdogs and other software procedure are commonly put in place, aren’t we less keen to accept hardware failure? Even trickier: what about hardware generated data corruption? The hardware shows no sign of failure, the software is not infected by viruses….still something’s not right.
Soft errors, even as being a small contributor of the overall reliability of systems, can still be the source of undetected failures that propagate to whole systems.
What are we doing to mitigate this problem, especially in cloud computing and data storage infrastructure?