The Australian Transportation Safety Board (ATSB) released
on December 19th 2011 the final report of investigations of two
repeated nose dive incidents during a Qantas Airline Airbus A330 flight from
Singapore to Perth in October 2008 (Qantas Flight 72) resulting in an accident. The plane landed at a
nearby airport after the incident which caused at least 110 injured
passengers/crewmembers and some damages to the inside of the aircraft.
Here’s the full report: http://www.atsb.gov.au/media/3532398/ao2008070.pdf
See section 3.6.6 page 143 for discussion about Single Event
Effect and Appendix H.
The report is inconclusive about the root cause of the
incident. Its origin occurred in an avionic equipment called the TLN101 ADIRU
(Air Data Inertial Reference Unit).
Incorrect data for all the flight
parameters were sent by this unit to the other avionics equipments, eventually
creating a false angle of attack information misleading the central computer
that reacted with a quick nose down maneuver.
The report mentions that probably the wrong signal came from
the CPU (Intel 80960MC) inside the ADIRU. Other chips interacting with the CPU
and therefore potentially sending wrong signals are an ASIC from AMIs, wait
state RAM from Austin Semiconductor (64kx4) and RAM from Mosaic Semiconductor
(128kx8).
When skimming through the report and especially the section
about SEE, I had few thoughts:
The estimated SEU failure rate of the equipment is 1.96 e-5
per hour, or 19.600 FIT (note: none of the memories were protected by ECC or
EDAC). At an altitude of 37.000 ft the neutron acceleration factor compared to
ground reference (NYC) is 83x (report data), therefore the equivalent FIT at
ground level should be 236 FIT. The order of magnitude seems about right, even
though I’d like to have more data about the process node and total size of embedded
memory. This FIT rate is just an estimate (from theory, not from test) and seems
to take only memory SBU (Single Bit upset) into account.
The investigators couldn’t reproduce the symptoms through
test. They focused mainly on neutron testing at 14MeV. I imagine this is
because it was a source that they could access easily. Maybe a wider neutron
range up to hundreds of MeV (like white neutron spectrum at Los Alamos, TSL or
Triumf) would have been more appropriate, especially to create MCU (Multi Cell
Upsets). The report states that the rate MCU/SBU is about 1%, so they didn’t
investigate further. This depends on the process node! At latest technologies
(40nm, 28nm) this ratio can be up to 40% on SRAM.
The components seemed to have been manufactured with older
process nodes. But as such, did they check the effect of thermal neutron (Boron
10 was used in older technologies)? Of alpha particles contamination of the
package?
I believe that this report needs a little more details on
the issue, a little more investigation to try to be more conclusive….Any
thoughts and comments?
In section 3.6.6. of the report, a summary is given as "In-service SEE History" and it says:
ReplyDelete"One of these occasions involved a NAV IR 1 FAULT message on ADIRU 4122 on 18 July 2008. The BITE data showed that a checksum fault had occurred (that is, the BITE detected that the copy of operational software stored in read-only memory was different to the version loaded into RAM when the ADIRU was in operation). The ADIRU manufacturer reported that they had records of about 100 similar events on LTN-101 units since 2000, and that similar errors had resulted during the unit testing in 2005.
A high proportion of the 116 events were soft faults that resulted in the ADIRU shutting down. Insufficient information was available for most of these events to determine the origin of the faults."
I would say it is more than enough evidence for the cause.
Good point Eigenix. Maybe further investigations through test and simulation would have reduced the uncertainty of the conclusion.
Delete