SEECA - Section 6

SEECA

Single Event Effect Criticality Analysis

Sponsored by NASA Headquarters/ Code QW

February 15, 1996

for more information, contact Kenneth A. LaBel

Introduction
1. The SEE Problem
2. Functional Analysis and Criticality
3. Ionizing Radiation Environment Concerns
4. Effects in Electronic Devices and SEE Rates
5. SEU Propagation Analysis: System Level Effects
6. SEE Mitigation: Methods of Reducing SEE Impacts
7. Managing SEEs: System Level Planning
8. SEE Criticality Assessment Case Studies

Section 6


SEE Mitigation: Methods of Reducing SEE Impacts

Kenneth A. LaBel, NASA Goddard Space Flight Center

6.1 Introduction

For simplicity's sake, it is convenient to classify system level SEE effects into two general categories: those that affect data responses of a device, and those that affect control of a device or system. Whereas there is some overlap between the two (an obvious example being a bit flip in a memory device that contains executable code for a processor), we may consider data errors to be those that occur in memory structures or data streams and control errors to be in other hardware such as microprocessors, power devices, or FPGAs.

All of the potential SEE mitigation methods may require that either additional hardware or software be added to the system design. The complexity and , in many cases, the increase in system overhead caused by the addition(s) are fairly linear with the power of the mitigation scheme.

The most cost efficient approach of meeting an SEE requirement may be an appropriate combination of SEE-hard devices and other mitigation. The cost, power, volume, performance, and availability of radiation-hardened devices often prohibits their use. Hardware or software design may serve as effective mitigation, but design complexity may present a problem. A combination of the two may be the most effective and efficient option.

6.2 Sample System Level Mitigation Techniques and Examples

6.2.1 Classification of System Level SEEs by Device Type

Much as we partition SEEs into two arenas, we may divide devices into two basic categories: those that are memory or data-related devices such as RAMs or ICs that are used in communication links or data streams, and those that are control-related devices such as a microprocessor, logic IC, or power controller. That is not to say that there is no overlap between the two categories. For example, an error could occur in the cache region of a microprocessor and cause a data error, or a data SEU (bit flip) might occur in a memory device that contains an executable program potentially causing a control SEU.

6.2.2 Mitigation of Memories and Data-Related Devices

The simplest method of mitigating errors in memory/data stream is to utilize parity checks. This method counts the number of logic one states (or "ones") occurring in a data path (i.e., an 8-bit byte or 16-bit word, etc...) [1]. Parity, usually a single bit added to the end of a data structure, states whether an odd or even number of ones were in that structure. This method detects an error if an odd number of bits are in error, but if an even number of errors occurs, the parity is still correct (i.e. the parity is the same whether 0 or 2 errors occur). Additionally, this is a "detect only" method of mitigation and does not attempt to correct the error that occurs.

Another common error detection only method is called cyclic redundancy check (CRC) coding [2]. This scheme is based on performing modulo-2 arithmetic operations on a given data stream, then interpreting the result as a polynomial. The N data bits are treated as a N-1 order polynomial. When encoding occurs, the data message is modulo-2 divided by the generating polynomial. The remainder of this operation then becomes the CRC character that is appended to the data structure. For decoding, the new bit structure which includes the data and CRC bits is again divided by the generating polynomial. If the new remainder is zero, no detectable errors were observed. A commonly used CRC code, especially for mass storage such as tape recorders, is the CRC-16 code which leaves a 16-bit remainder.

Hamming code is a simple block error encoding (i.e., an entire block of data is encoded with a check code) that will detect the position of a single error and the existence of more than one error in a data structure [1]. Hamming strategy essentially states that if there are Q check bits generated using a parity-check matrix, then there is a syndrome represented by the Q-digit word that can describe the position of a single error. This is seen simply, for example, by having a syndrome (s) with s=000H being the no error condtion in a single byte, s=001 being an error in bit 1 of the byte, and so on. By determining the position of the error, it is possible to correct this error. Most designers describe this method as "single bit correct, double bit detect". This EDAC scheme is common among current solid-state recorders flying in space [for example, 3-5]. When a system performs this EDAC procedure, it is called scrubbing (i.e., scrubbing of errors from clean or good data). An example would be an 80-bit wide memory bus having a 72-bit data path and 8-bits of Hamming code. This coding method is recommended for systems with low probabilities of multiple errors in a single data structure (e.g., use only with a single bit error condition in a byte-wide data field).

Other block error codes, while beyond the scope of this paper in terms of operational description, provide more powerful error correcting codes (ECCs). Among these, Reed-Solomon (R-S) coding is rapidly becoming widespread in its usage [6]. The R-S code is able to detect and correct multiple and consecutive errors in a data structure. An example [7] is known as (255,223). This translates to a 255 byte block having 223 bytes of data with 32 bytes of overhead at the end of the message. This particular R-S scheme is capable of correcting up to 16 consecutive bytes in error. This R-S encoding scheme is available in a single IC as designed by NASA VLSI Design Center [7]. A modified R-S scrubbing for a SSR has been performed in-flight by software tasks as well [5].

Convolutional encoding [8], again outside the scope of operational description, is able to detect and correct multiple bit errors, but differs from block coding by interleaving the overhead or check bits into the actual data stream rather than being grouped into separate words at the end of the data structure. This style of encoding is typically considered for usage in communication systems and provides good immunity for mitigating isolated burst noise.

System level protocol methods are best understood by illustration. The SEDS MIL-STD-1773 fiber optic data bus has been successfully flying since July, 1992 [9]. This system utilizes among itís error control features two methods of detection: parity checks and detection of a non-valid Manchester encoding of data. This military standard has a system level protocol option of retransmitting or retrying a bus transaction up to three times if the error detection controls are triggered. Thus, the error detection schemes are via normal methods (parity or non-valid signalling), while the error correction is via retransmission.

Retransmission of data on a communication link may be autonomously performed as in the example above or may be accomplished via ground intervention. For example, if data collected in a SSR shows an unacceptable BER during a "pass" or downlink transmission to a ground station, the station may then issue a command to the spacecraft requesting retransmission of all or a selected portion of that data.

All of the above methods provide ways of reducing the effective BER of data storage areas such as SSRs, communication paths, or data interconnects. Table 6.1 summarizes sample EDAC methods for memory or data devices and systems.

Table 6.1 Sample EDAC Methods for Memory or Data Devices and Systems
EDAC MethodEDAC Capability
ParitySingle bit error detect
CRC CodeDetects if any errors occurred in a given data structure
Hamming CodeSingle bit correct, double bit detect
RS CodeCorrect consecutive and multiple bytes in error
Convolutional encodingCorrects isolated burst noise in a communication stream.
Overlying protocolSpecific to each system implementation

6.2.3. Mitigation of Control-related Devices

Whereas the above techniques are useful for data SEUs, they may also be applicable to some types of control SEUs as well (microprocessor program memory, again being an example). Other devices such as VLSI circuitry or microprocessors have more complex difficulties to be aware of. Potential hazard conditions include items such as the issuance of an incorrect spacecraft command to a subsystem or a functional interruption of the system operation. Microprocessors are among the many new devices that have "hidden" registers. These are registers that are not readily accessible external to the device (i.e., on I/O pins), but provide internal device control and whose SEUs could affect the device or system operation.

Microprocessor software typically has tasks or subroutines dubbed Health and Safety (H&S) which may provide some mitigation means directly applicable to SEE [10]. These H&S tasks may perform memory scrubbing utilizing parity or other methods on either external memory devices or registers internal to the microprocessor. The software-based mitigation methods might also use internal microprocessor timers to operate a watchdog timer (see below) or to pass H&S messages between spacecraft systems. A relevant example would be if the software provided a parity check on the stored program memory when accessing an external or internal device such as a electrically erasable programmable read only memory (EEPROM). If a parity error was detected on a program memory fetch, the software might then access (read) the memory location a second time, place the system into a spacecraft safing or safe operations mode, or read the program from a redundant EEPROM.

Watchdog timers may be implemented in hardware or software or through a combination of both. Typically, watchdogs are thought of as an "Iím okay" method of error detection. That is, a message indicating the health of a device or system is sent from one location to another. If the message is not received by the second location within a set time period, a "time out" has occurred. In this instance, the system then may provide an action to the device, box, subsystem, etc... Watchdog timers may be implemented at many levels: subsystem-to subsystem, box-to-box, board-to-board, device-to-device, etc... Watchdogs may be active or passive. The different types are best understood by example.

Example 1 is an active watchdog. Device A has to send a "Iím okay" pulse on a once per second basis to an independent device B. B, for example, is an interrupt controller for a microprocessor system. If A fails to send this pulse within the allocated time period, device B "times out" and initiates a recovery action such as issuing a reset pulse, removing power, sending a telemetry message to the ground, placing the spacecraft into safing mode, etc... Bís actions are very specific to each mission scenario and spacecraft mode of operation.

Example 2 is a passive watchdog timer. In spacecraft Xís normal operating scenario, it receives uplink messages (commands, code patches, table loads, etc...) from the ground station every twelve hours. There is a timer on-board the spacecraft that times out if no uplink is received within this 12 hour (or perhaps, a 24 hour) time frame. The spacecraft then initiates an action such as a switch to a redundant antenna or uplink interface, a power cycling of the uplink interface, etc... What makes this a passive watchdog is that no specific "Iím okay" needs to be sent between peers, but a monitoring of normal operating conditions are sufficient.

Redundancy between circuits, boxes, subsystems, etc... provides a potential means of recovery from an SEE on a system level. Autonomous or ground-controlled switching from a prime system to a redundant spare provides system designers an option that may or may not fit within mission-specific spacecraft power and weight restrictions. Redundancy between boxes is relatively straightforward, therefore we present a lower system level redundancy example. The MIL-STD-1773 fiber optic data bus is a fully redundant bus with an A side and a B side. Redundancy, in this implementation, allows the system designer to automatically switch from the prime (A) side to the redundant (B) side for all transactions in case of a failed transmission on the A bus, or to retry on the B side in case of an A failure, or wait for a command to switch to B if the bus BER on the A side exceeds a specified limit, etc...

Operating two identical circuits with synchronized clocking is termed a lockstep system. One normally speaks of lockstep systems when discussing microprocessors [11]. Error detection occurs if the processor outputs do not agree, implying that a potential SEU has occurred. The system then has the option of reinitializing, safing, etc... It must be pointed out that for longer spacecraft mission time frames, lockstep conditions for commercial devices must be well thought out. In particular, the TID degradation of the commercial devices must be examined for clock skew with increasing dosage. This may potentially cause "false" triggers between two such devices if each responds to dosage even slightly differently.

Voting is a method that takes lockstep systems one step further: having three identical circuits and choosing the output that at least two agree upon. Katz, et al. [12] provide an excellent example of this methodology. They have proposed and SEU-tested a triple modular redundancy (TMR) voting scheme for FPGAs, i.e., three voting flip-flops per logical flip-flop. FPGAs, one should note, replace older LSI circuits in many systems by providing higher gate counts and device logic densities. Thus, the IC count as well as the physical space required for spacecraft electrical designs may be reduced. The TMR scheme proposed does not come without an overhead penalty; one essentially loses over two-thirds of the available FPGA gate count by implementing this method.

The discussion of FPGAs brings out an interesting point: systems are becoming increasingly more complex as well as integrated. Gate arrays, FPGAs, and application specific ICs (ASICs) are becoming increasingly more commonplace in electrical spacecraft designs. Liu and Whitaker [13] provide one such SEU hardening scheme to provide SEU immunity in the custom IC design phase that is applicable to spacecraft designs. This method provides a logic configuration which separates the p-type and the n-type diffusion nodes within a memory circuit.

The use of "good" engineering practices for spacecraft contributes another means of SEU mitigation [14]. Items such as the utilization of redundant command structures (i.e., two commands being required to trigger an event usually with each command having a different data value or address), increased signal power margins, and other failsafe engineering techniques may aid an SEU hardening scheme.

These and other good engineering practices usually allow designers to be innovative and discover sufficient methods for SEU mitigation as needed. The authors would like to point out that the greatest risk to a spacecraft system and conversely, the greatest challenge to an electrical designer is having unknown device or system SEE characteristics.

6.2.4 Treatment of Destructive Conditions and Mitigation

Destructive SEE conditions may or may not be recoverable depending on the individual device's response. Hardening from the system level is difficult at best, and in most cases, not particularly effective.

This stems from several concerns. First, non-recoverable destructive events such as single event gate rupture (SEGR) or burnout (SEB) require redundant devices or systems be in place since the prime device fails when the event occurs. SEL may or may not have this same failure with each malfunction response being very device specific. Microlatch, in particular, is difficult to detect since the device's current consumption may remain within specification for normal device operation. LaBel, et al. [15] have demonstrated the use of a multiple watchdog timeout scheme as a potential mitigation. In this instance, the first level watchdog acts as an "I'm okay" within a local circuit board. If this watchdog is triggered, a reset pulse is issued to the local circuitry. If this trigger-reset scenario occurs N times continuously or fails to recover the board within X seconds, a secondary watchdog is triggered that removes power from the board. Power is restored via a ground command. This SEDS system was successfully SEL tested at BNL.

For individual devices, a current limiting circuit that may also cycle power is often considered. However, the failure modes of this protection circuit are sometimes worse than finding a less SEL-sensitive device (e.g., infinite loop of power cycling may occur). Hence, SEL should be treated by the designer on a case-by-case basis considering the deviceís SEL response, circuit design, and protection methods. Please note that multiple latchup paths are present in most circuits, each with a different current signature. This makes the designer's job difficult in specifying the required current limit.

A concern similar to microlatch exists if, for example, current limiting is performed on a card or higher integration level and not on an individual device. A single device might enter a SEL state with a current sufficient to destroy the device, but not at a high enough current level to trigger the overcurrent protection on a card or higher level. The key here is again to know the device's SEL current signatures for each of its latchup paths.

One other, and more risky method of SEL protection due to its potential time lags to detect and recover is best demonstrated by example. An ADC has a known SEL sensitivity. The deviceís current consumption is gathered periodically via a control processor. If the read current exceeds a specified limit, power cycling is performed. This method may also use either telemetry data points for ground intervention or a device's specific or internal calibration parameters to be successful [16].

6.2.5 Sample Methods of Improving Designs for SEE Performance

By changing the design of a circuit or certain circuit parameters, improved SEU performance may be gained. Marshall, et al. [17] and LaBel, et al. [18] have demonstrated several ways of improving a fiber optic linkís SEU-induced BER. First is the selection of diode material (typically, III-V versus Si). The use of a III-V material results in a significantly smaller device sensitive volume. A second way to reduce BER is by the selection of the method for received signal detection (edge-triggered versus level sensitive) with a level-sensitive system being less SEU sensitive. A third scheme for BER reduction is to define a dynamic sensitive time window. This method essentially states that there are only certain time periods when the occurrence of a radiation-induced transient will have an observed effect. Lastly, by increasing the optical power margin, the BER is also reduced. These and similar techniques may apply to other designs as well.

6.2.6 Sample Methods of Realistic SEE Risks and Usage

Deciding whether an SEE in a device has a risk factor that makes a device usable in spaceflight or not is complex at best. Many factors weigh into the concern: mission environment, device test data, modes of operation, etc... Several sample system issues may clarify the types of issues that are involved.

The SEDS RPP uses separate EEPROMs for its boot and application software storage on-board the SAMPEX spacecraft [19]. These particular EEPROMs have shown a sensitivity to SEUs while being programmed, albeit not during read operations. In addition, stuck bits may occur during programming operations at LETs above Ni-58 (i.e., there is a low probability of occurrence in-flight). Since its launch in July of 1992, the application software EEPROMs have successfully been reprogrammed in-flight twice, but with certain constraints. These mission-specific constraints include: the time period for programming uses a relatively proton and heavy ion flux-free portion of the orbit, and that the boot EEPROM is not programmed during flight. Why was the risk taken? The SEDS the verifies the newly programmed data by the use of a CRC code as well as by ground station activities prior to loading the new executable software for SEDS operations. If an incorrect byte was programmed into the device, this mitigation scheme would catch it. If a stuck bit is discovered in the EEPROM, a recovery option is built-in to the system that provides a memory mapping around the failed location. Lastly, since the actual time window during programming when the device is susceptible to error is very small, few, if any, particles capable of causing an anomaly are seen at the device. However, it should be noted that the risk might be deemed unacceptable if continuous programming of the EEPROM was being performed throughout the mission's orbit.

The SEDS system has previously been pointed out for its use of system level error control in its fiber optic data bus as well as for the use of Hamming code EDAC on its SSR[3,9]. The SEDS system also has a multi-layer system of watchdog timers that monitor system operation [19]. The layers are as follows:

- a software task executing in the main spacecraft microprocessor that times out if a value is not passed by a second software task and that restarts the processor from a known state,

- a programmable interrupt signal from the main spacecraft microprocessor that provides a reset pulse to an external timer circuit that times out if not written to within an N second window causing a hardware reset pulse to occur to the processor,

- if multiple reset pulses occur consistently, this same external timer circuit provides a H&S message to a secondary processor box whereupon the secondary takes action,

- an "I'm okay" pulse between the prime and secondary processors that must occur once every X seconds upon which the secondary processor may remove/cycle power to the main processor or place the spacecraft in safehold until ground station intervention, and

- a multi-day timer that places the sapcecraft into safehold if proper system operations have not occurred within a 24 hour period.

As one may observe, mitigation methods for the SEDS are performed on several levels: software, device, circuit/card, box, and subsystem/spacecraft. Also note the use of both active and passive watchdogs.

6.3 Summary

We have presented a sampling of information regarding SEE mitigation from the systems design level. This has included defining functional impacts of SEEs, examples of spacecraft designs, potential methods of SEE mitigation, as well as an example of realistic risks in space utilization of a sensitive EEPROM.

6.4 Acknowledgements

We would like to acknowledge the insight provided by Dr. Paul Marshall in numerous discussions prior to the drafting of this document.

6.5 References

1. A.B. Carlson, Communication Systems, New York: McGraw Hill Book Company, 1975.

2. K. L Short, Microprocessors and Programmed Logic, Second Edition, New Jersey: Prentice Hall, Inc., 1987.

3. K.A. LaBel, S. Way, E.G. Stassinopoulos, C.M. Crabtree, J. Hengemihle, M.M. Gates, "Solid State Tape Recorders: Spaceflight SEU Data for SAMPEX and TOMS/Meteor-3", Workshop Record for the 1993 IEEE Radiation Effects Data Workshop, pp 77-84, 1993.

4. R. Harboe-Sorensen, E.J. Daly, L. Adams, "Observation and Prediction of SEU in Hitachi SRAMS in Low Altitude Polar Orbits", IEEE Trans. Nucl. Sci., vol 40, pp 1498-1504, Dec 1993.

5. C.I. Underwood, R. Ecoffet, S. Duzellier, D. Faguere, " Observations of Single-Event Upset and Multiple-Bit Upset in Non-Hardened High-Density SRAMs in the TOPEX/Poseidon Orbit", Workshop Record for the 1993 IEEE Radiation Effects Data Workshop, pp 85-92, 1993.

6. W.K. Miller, NASA Goddard Space Flight Center, Private communication, 1995.

7. R-S Encoder Data Sheet, NASA VLSI Design Center, 1994.

8. W.L. Pritchard, H.G. Suyderhoud, R.A. Nelson, Satellite Communication Systems Engineering, New Jersey: Prentice Hall, Inc., 1993.

9. K.A. LaBel, P.W. Marshall, C.J. Dale, C.M. Crabtree, E.G. Stassinopoulos, M.M. Gates, "SEDS MIL-STD-1773 Fiber Optic Data Bus: Proton Irradiation Test Results and Spaceflight SEU Data", IEEE Trans. Nucl. Sci., vol 40, pp 1638-1644, Dec 1993.

10. R.J. Whitley, NASA Goddard Space Flight Center, Private communication, 1995.

11. J.L. Kaschmitter, D.L. Shaeffer, N.J. Colella, C.L. McKnett, P.G. Coakley, "Operation of Commercial R3000 Processors in the Low Earth Orbit (LEO) Space Environment", IEEE Trans. Nucl. Sci., vol 38, pp 1415-1428, Dec 1991.

12. R. Katz, R. Barto, P. McKerracher, R. Koga, "SEU hardening of Field Programmable Gate Arrays (FPGAs) for Space Applications and Device Characterization", IEEE Trans. Nucl. Sci., vol 41, pp 2179-2186, Dec 1994.

13. M.N. Liu, S. Whitaker, "Low Power SEU Immune CMOS Memory Circuits", IEEE Trans. Nucl. Sci., vol 39, pp 1679-1684, Dec 1992.

14. Engineering Directorate Electrical Design Guidelines, NASA/GSFC, 1991. 15. K.A. LaBel, E.G. Stassinopoulos, G.J. Brucker, C.A. Stauffer, "SEU Tests of a 80386 Based Flight-Computer/Data-Handling System and Discrete PROM and EEPROM Devices, and SEL Tests of Discrete 80386, 80387, PROM, EEPROM and ASICS", Workshop Record for the 1992 IEEE Radiation Effects Data Workshop, pp 1-11, 1992.

16. S.K. Miller, Orbital Sciences Corporation, Private communication, 1994.

17. P.W. Marshall, C.J. Dale, M.A. Carts, K.A. LaBel, "Particle-Induced Bit Errors in High Performance Fiber Optic Data Links for Satellite Data Management", IEEE Trans. Nucl. Sci., vol 41, pp 1958-1965, Dec 1994.

18. K.A. LaBel, D.K. Hawkins, J.A. Cooley, C.M. Seidleck, P.W. Marshall, C.J. Dale, M.M. Gates, H.S. Kim, E.G. Stassinopoulos, " Single Event Effect Ground Test Results for a Fiber Optic Data Interconnect and Associated Electronics", IEEE Trans. Nucl. Sci., vol 41, pp 1999-2004, Dec 1994.

19. K.A. LaBel, NASA Goddard Space Flight Center, Private communication, 1995.

Introduction
1. The SEE Problem
2. Functional Analysis and Criticality
3. Ionizing Radiation Environment Concerns
4. Effects in Electronic Devices and SEE Rates
5. SEU Propagation Analysis: System Level Effects
6. SEE Mitigation: Methods of Reducing SEE Impacts
7. Managing SEEs: System Level Planning
8. SEE Criticality Assessment Case Studies