SEECA - Section 5

SEECA

Single Event Effect Criticality Analysis

February 15, 1996

for more information, contact Kenneth A. LaBel

Introduction
1. The SEE Problem
2. Functional Analysis and Criticality
3. Ionizing Radiation Environment Concerns
4. Effects in Electronic Devices and SEE Rates
5. SEU Propagation Analysis: System Level Effects
6. SEE Mitigation: Methods of Reducing SEE Impacts
7. Managing SEEs: System Level Planning
8. SEE Criticality Assessment Case Studies

Section 5 SEU Propagation Analysis: System Level Effects

Kenneth A. LaBel, NASA Goddard Space Flight Center

5.1 Definition

SEU propagation is the art and science of determining the effect and potential impact that the occurrence of an SEU has on the device where the SEU occurs, its associated circuitry, subsystem, system, and spacecraft. That is to say, how an SEU propagates up the ladder of design integration. For example, an SEU occurs in an A-to-D converter causing a single incorrect data sample to be gathered. This "invalid" data sample may provide an incorrect data point such as a star location or a misleading temperature value.

5.2 Ground Test and Simulation of System Level or Propagated SEEs

The concept of propagated SEUs is straightforward to the typical electrical engineer. It is similar to what one might perform in a standard mathematical circuit simulation, that is, how a signal pulse, transient, or state will affect a circuit's performance either instantly or in future clock cycles.

Several groups have published information pertaining to either the simulation of SEU effects and their propagation to circuit and system level, as well as the performance of SEE ground testing on devices with the actual circuit design as used in a spacecraft system [1-12].

Newberry, et al. [1-3] have been leaders in the area of SEU propagation. In particular, they have discussed the effects of radiation-induced input/output (I/O) transients or noise spikes on system performance as well as that of VLSIC transients. In essence, the concept relays the idea that traditional bit flips in memory cells are not the only cause of SEUs on a system level, but also SEU-induced voltage spikes occurring in logic or I/O devices impact the system SEU rate and effects. This work was also among the first to discuss transients and circuit-specific levels for defining SEUs (i.e., duration and amplitude constraints). For example, a 0.25V spike of 5 nanoseconds in duration may or may not be observed by the following circuit elements.

Leavy, et al. [4] have described the propagation of events inside of a bulk CMOS microprocessor with SEU-hardened clocked flip-flops. In this instance, SEU-induced transients on the clock lines were shown to be capable of causing upsets to microprocessor operation. As a side note, Leavy, et al. were able to solve this problem through a circuit redesign for their next foundry run of the microprocessor.

LaBel, et al. [5,6] have described the effects of transients in a fiber optic receiver photodiode as well as how this affects a system bit error rate (BER) from both the physical link perspective as well as through higher layers of network protocol. This will be described below.

SEU-induced transients in analog devices have been reported by several organizations [7,8,9]. All of these references point out two facts. First, transients in devices such as a comparator or op amp may propagate to the digital electronics in the surrounding circuitry. Depending on the specific circuit designs, these transients may only corrupt a single telemetry sample or, in a worst case scenario, cause system disfunction or failure. The second item was pointed out by Newberry [3] as well: the definition of an analog SEU phenomena is specific to the interface circuitry surrounding the radiation-sensitive device.

Taking this one step further, Turflinger, et al. [10,11] have extensively delved into separating SEUs for conventional analog-to-digital converters (ADCs) into several categories. The two major categories are noise and offset errors that are analogous to Gaussian and non-Gaussian errors. Neither of these errors is fatal to the device itself, but both are capable of causing erroneous telemetry and misinterpretation by or impairment of the surrounding spacecraft systems.

McCarty, et al. [12] also have explored an ADC. However, this ADC was not a conventional successive-approximation register (SAR) or flash ADC, but a complex hybrid delta-sigma averaging ADC susceptible to both noise and offset errors, as well as control errors. These control errors are capable of affecting device operation and calibration. Furthermore, they hinder system performance in a space environment.

At this point, we have emphasized the effects of transient SEUs on system performance. It is not intended to slight digital SEU effects such as bit flips. These types of SEUs may propagate, for example, from a control or data register inside of a microprocessor into operational performance of the circuit or system. A worst case example may be the false commanding of critical hardware such as a thruster or pyro.

In some instances, it is not required to know what particular area of a device has seen an SEU, but how well the system mitigation design will work. NASA has been among the first to fly a commercial 32-bit microprocessor in a critical space application [13]. The Small Explorer Data System (SEDS) is a spacecraft Command and Data Handling subsystem for the Solar Anomalous Magnetospheric Particle Explorer (SAMPEX) mission at Goddard Space Flight Center (GSFC). Included as a critical portion of the SEDS is the Recorder Processor Packetizer (RPP): an INTEL 80386 microprocessor-based flight computer with 26.5 MBytes of solid state data storage.

One of the design features of the SEDS is its built-in fault tolerance and its ability to recover from observed errors. This is accomplished via SEDS hardware watchdog circuitry (at multiple levels: circuit, board, box, etc...) as well as software health and safety tasks. To this end, a SEU test was performed on the RPP. SEUs were induced on the 80386 microprocessor family in order to verify the fault tolerant capabilities of the SEDS [13]. The Brookhaven National Laboratories' tandem Tandem VandeGraaff accelerator was utilized for this purpose.

To summarize the SEDS ground SEE test results, several different errors were observed including a halting of the RPP's operation and "processor exceptions". All the SEE events were recoverable using planned mitigation techniques by the SEDS.

It should be noted that the SEDS has been performing flawlessly from the SEE mitigation perspective since its launch in July of 1992.

5.3 Propagation Analysis Methodology

In many ways, SEU propagation is similar to both traditional circuit simulation and FMEA. In both instances, the end result is to determine the end effects that an error or failure has on the performance of a device, circuit, or system. To this end, we shall trace the steps and engineer may utilize in determining SEU propagation effects.

5.3.1 Device Analysis

This is the lowest level of propagation analysis included herein. Figure 5.1 illustrates this methodology.

Step 1: Is the device sensitive to SEUs?

This is relatively straightforward,
- If the answer is no, then no further analysis is required.
- If the answer is yes, then go to Step 2.

Step 2: Does the device meet mission requirements?

A device that has a known SEU sensitivity might still meet mission requirements. An example would be a device having an LET_th = 45 MeV*cm²/mg when the mission requires devices with a LET_th > 35 MeV*cm²/mg. The device is not insensitive to SEUs, but is acceptable for this particular mission.

Step 3: Determine SEU sensitive device areas.

In analyzing a device, one must determine where and what types of SEUs may occur. Simple devices such as a memory device may have two device areas for discussion: memory cells and control logic while complex devices such as microprocessors may have dozens of individual areas. As one would expect, the more highly integrated a device is, the more sensitive areas may be associated with it. For simplicity, we shall limit the types of SEUs discussed to two types: bit flips (state changes) that typically occur in memory cells or flip-flops, and transients, those SEUs that occur in combinatorial logic or manifest themselves as a "noise" spike on both analog and digital IC areas. Table 5.1 illustrates several potential ICs and their associated areas. This list should not be construed as exhaustive, but simply a sampling of device types.

Table 5.1 Sample Device Types and Sensitive Areas
Device Type Sensitive Area SEU Types

Memories Memory cells Bit flips

Control Logic Bit flips if sequential, Transients if combinatorial

Combinatorial logic Combinatorial logic Transients

Sequential logic Sequential logic Bit flips

FPGAs Combinatorial logic Transients

Sequential logic Bit flips

Microprocessors Registers, cache, sequential control logic Bit flips

Combinatorial control logic Transients

ADCs, DACs Analog portion Transients

Digital portion Bit flips or transients depending on design

Linear ICs Analog area Transients

Photodiodes Photodiode Transients

Table 5.1 Sample Device Types and Sensitive Areas
Device Type	Sensitive Area	SEU Types
Memories	Memory cells	Bit flips
Control Logic	Bit flips if sequential, Transients if combinatorial
Combinatorial logic	Combinatorial logic	Transients
Sequential logic	Sequential logic	Bit flips
FPGAs	Combinatorial logic	Transients
Sequential logic	Bit flips
Microprocessors	Registers, cache, sequential control logic	Bit flips
Combinatorial control logic	Transients
ADCs, DACs	Analog portion	Transients
Digital portion	Bit flips or transients depending on design
Linear ICs	Analog area	Transients
Photodiodes	Photodiode	Transients

Step 4: Determine operational parameters

How a device is being utilized in its specific application may affect its SEU performance as well. Parameters such as access rates, operational modes, clock frequency, power supply voltage, etc... have definitive impacts not on the occurrence, but on the observed effect of an SEU. Several examples may aid the reader to understand this.

Starting with an SRAM device, used in a data storage area, provides a simple example. SRAMs, again for convenience, have three operating conditions: Read, Write, and Static (Data Storage) modes. SEU ground testing may show each mode to have a different SEU sensitivity, i.e. LET_th and cell cross-section. In a typical SSR application, an SRAM is written to once between downlink operations to the ground, read once during downlink playback, and remains in static mode for the remainder of the time (typically >99%). Because all memory cells in a device are not written to at the same time (i.e., one byte at a time), SEUs that have an observed effect are those that occur during a write or read operation and those that occur after the device is written to and prior to downlink. If an SEU occurs during the time period between downlink and the writing of a memory cell, the SEU would be overwritten during the write operation. Hence, that particular SEU has no observed effect. This is sometimes known as a benign SEU. Additionally, actual write and read accesses take on the order of 10-200 nsecs to occur. Thus, the sensitive time window, i.e., the time period when an SEU has an observed effect, is very small for these operations.

A second sample scenario might involve a microprocessor. As discussed previously, these types of devices are very complex and have many different areas where an SEU may occur. Some areas have obvious effects on the device performance: for example, a program control (PC) register. If a bit flip occurs in the PC, the microprocessor program flow is disrupted. However, there may be other device areas such as a status register or an area of the device not being utilized where the occurrence of an SEU is benign. If, for example, the microprocessor has a programmable interval timer (PIT) built-in, one must know if and how it is utilized in this specific design. If the PIT is not used, the SEU would be benign. If the PIT is utilized, one must analyze what performance effect (i.e., different time period than expected) this has based on when the SEU occurs. Additionally, one should know the expected operating modes and area utilization to determine sensitive time windows and non-benign SEU conditions.

Other parameters may affect the device's SEU performance. These include clock frequency and power supply voltage. One should always ask the "what if" question: what if an SEU occurred at location A during time period B? Note that the probability of SEU observance is linked to the sensitive time window for the event as well as to area SEU sensitivity and the environment.

Step 5: Determine/simulate device performance

Now that we have determined the sensitive device areas and operational effects on observed SEUs, the determination of what apparent effect the SEU has on device performance must be explored. Several outcomes may transpire. These include, but are far from limited to:

- improper device operation,
- incorrect device output,
- errors in memory structures to be accessed externally,
- noise spikes on transmission lines,
- device mode changes such as going from an active to standby mode, and,
- incorrect device timing.

If one looks at this as a traditional circuit simulation, digital test vectors with errors (SEUs) could be used to determine the observed effect. At a lower level, SPICE (analog) simulations with injected transients could be utilized as well. Sample scenarios would include FPGA simulations of combinatorial and/or sequential logic or a microprocessor PIT sending out a pulse at an incorrect time. The output of this analysis is a list of potential SEUs for each device.

5.3.2 Circuit Level Analysis

Circuit level analysis follows the same steps (3-5) as the device level but with the key now being the circuit operation and performance. As with device level analysis, once we know which devices have SEUs and what those SEUs may look like, we then look at the operational parameters and their impacts on SEU performance. For example, we know that a bit flip may occur in an SRAM, but the circuit level effects are dependent on the what the SRAM is being used for in this application. Sample propagated effects might include:

- an SEU in an SRAM being used for data storage
---> a bad data point,
- an SEU in an SRAM holding software program instructions
---> improper processor operation or flow, or
- an SEU in an SRAM used as a shared memory buffer between two other ICs such as a processor and a direct memory access (DMA) controller
---> any of a large number of potential error conditions (program flow, bad data point, etc).

One must again be aware of the potential for benign and non-benign SEU effects. A sample case is as follows. Assume that a bus driver IC that is being used to drive a microprocessor address bus has an SEU-induced noise spike. Both the time that this spike occurs and the transient's amplitudes (time and voltage levels) determine whether this condition is observed by the surrounding circuitry as an error or not. Again, the concept of a sensitive time window is observed. If the transient occurs on a quiescent bus (i.e., no transactions taking place), the SEU is most likely benign. If the transient occurs on an active bus, the SEU may or may not be non-benign depending on the exact timing of the transaction and the noise spike, as well as the spike's amplitudes.

Once the operational analysis is performed, the engineer is again able to perform a circuit simulation using digital or analog tools. The output of this analysis is a list of the potential SEUs in a circuit and their effects on circuit operation. We may view this as a "black box" wherein the internal circuitry doesn't matter, but what is observed by the outside world (subsystem, system , etc...) is noted.

5.3.3 Higher Level Analysis

We may treat subsystem, system, and spacecraft levels of analysis in a single manner. Each of these levels handles the previous level as a black box, not worrying about intimate details, but only on the higher level effects. We will discuss the subsystem level herein as a representative analysis layer.

Once the circuit level analysis is complete, we begin the subsystem level analysis. In essence, we may treat the subsystem exactly like the circuit level, but look for performance aspects of the SEU-induced anomaly. An example follows.

A Command and Data Handling (CADH) subsystem may be composed of separate circuits such as those data storage, spacecraft command processing, attitude control processing, instrument interfacing, spacecraft engineering telemetry gathering, etc... Let's say, for instance, that an SEU occurs in the spacecraft command processing circuitry. To be more specific, we know by circuit analysis that this SEU causes the spacecraft command processing circuitry to have a false output. Again looking at operational parameters and sensitive time windows and amplitudes, we determine if and how this may affect the surrounding circuits and whether there is an effect on the subsystem performance and its output on the whole. For example, we determine if the false output propagates through the instrument interfacing circuit causing an incorrect output on the instrument command interface.

The system level analysis takes this one step further. By continuing with the CADH example, we observe that this false output again may or may not propagate to another subsystem. Depending again on sensitive time windows and amplitudes, an incorrect command may or may not be issued to the instrument.

The spacecraft level of analysis then would take the output of the system level analysis and determine, in this case, whether the incorrect command would affect the overall spacecraft operation. For example, we might observe incorrect instrument data being gathered or a system safing occur.

5.4 Example

To provide a little more detailed understanding, we shall discuss a typical ADC. This (hypothetical) ADC has both digital and analog sections. Let's assume an SEU occurs in a calibration RAM area of the device. We shall look at how this SEU could propagate to affect spacecraft performance.

At the device level, we observe a shift of the output levels by +1V. That is, each sample gathered is incorrect with a constant offset of +1V.

At the circuit level, we observe that the engineering telemetry circuit output for a temperature/thermistor circuit for the CADH subsystem has the same +1V offset.

At the subsystem level, the CADH subsystem observes that the CADH temperature is +10 degrees higher than previous.

At the system level, no direct effect is propagated to another subsystem, but we still observe the abnormally high temperature for the CADH subsystem.

At the spacecraft level, we observe that the CADH subsystem is operating at a temperature above its specified limit and take an action such as entering a safing mode, turning off a heater, or sending an anomaly report to the ground via downlink and then awaiting ground intervention to correct the anomaly.

5.5 Summary

We have presented some methodology in viewing the propagation of SEUs from the device level to the spacecraft level of integration. Understanding the effect a single bit flip or transient has on the spacecraft is a key to reducing risk in spacecraft programs.

5.6 References

1. D.M. Newberry, D.H. Kaye, G.A. Soli, "Single Event Induced Transients in I/O Devices: A Characterization", IEEE Trans. Nucl. Sci., vol 37, pp 1974-1980, Dec 1990.

2. D.M. Newberry, "Single Event Upset Error Propagation Between Interconnected VLSI Logic Devices", RADECS 91: IEEE Proceedings from, vol 15, pp 471-474, Sep 1991.

3. D.M. Newberry, "Investigation of Single Event Effects at the System Level", RADECS 93: IEEE Proceedings from, pp 113-120, Sep 1993.

4. J.F. Leavy, L.F. Hoffman, R.W. Shovan, M.T. Johnson, "Upset Due to a Single Particle Caused Propagated Transient in a Bulk CMOS Microprocessor", IEEE Trans. Nucl. Sci., vol 38, pp 1493-1499, Dec 1991.

5. K.A. LaBel, E.G. Stassinopoulos, G.J. Brucker, "Transient SEUs in a Fiber Optic System for Space Applications", IEEE Trans. Nucl. Sci., vol 38, pp 1546-1550, Dec 1991.

6. K.A. LaBel, P.W. Marshall, C.J. Dale, C.M. Crabtree, E.G. Stassinopoulos, M.M. Gates, "SEDS MIL-STD-1773 Fiber Optic Data Bus: Proton Irradiation Test Results and Spaceflight SEU Data", IEEE Trans. Nucl. Sci., vol 40, pp 1638-1644, Dec 1993.

7. R. Koga, S.D. Pinkerton, S.C. Moss, D.C. Mayer, S. LaLumondiere, S.J. Hansel, K.B. Crawford, W.R. Crain, "Observation of Single Event Upsets in Analog Microcircuits", IEEE Trans. Nucl. Sci., vol 40, pp 1838-1844, Dec 1993.

8. R. Ecoffet, S. Duzellier, P. Tastet, C. Aicardi, M. Labrunee, "Observation of Heavy Ion Induced transients in Linear Circuits", Workshop Record for the 1994 IEEE Radiation Effects Data Workshop, pp 72-77, 1994.

9. K.A. LaBel, A.K. Moran, D.K. Hawkins, J.A. Cooley, C.M. Seidleck, M.M. Gates, B.S. Smith, E.G. Stassinopoulos, P.W. Marshall, C.J. Dale, "Single Event Effect Proton and Heavy Ion Test Results for Candidate Spacecraft Electronics", Workshop Record for the 1994 IEEE Radiation Effects Data Workshop, pp 64-71, 1994.

10. T.L Turflinger, M. V. Davey, "Transient Radiation Test Techniques for High-Speed Analog-to-Digital Converters", IEEE Trans. Nucl. Sci., vol 36, pp 2356-2361, Dec 1989.

11. T.L. Turflinger, M.V. Davey, "Single Event Effects in Analog-to-Digital Converters: Device Performance and System Impact", IEEE Trans. Nucl. Sci., vol 41, pp 2187-2194, Dec 1994.

12. K.P. McCarty, J.R. Coss, D.K. Nichols, G.M. Swift, K.A. LaBel, "Single Event Effects Testing of the Crystal CS5327 16-Bit ADC", Workshop Record for the 1994 IEEE Radiation Effects Data Workshop, pp 86-96, 1994.

13. K.A. LaBel, E.G. Stassinopoulos, G.J. Brucker, C.A. Stauffer, "SEU Tests of a 80386 Based Flight-Computer/Data-Handling System and Discrete PROM and EEPROM Devices, and SEL Tests of Discrete 80386, 80387, PROM, EEPROM and ASICS", Workshop Record for the 1992 IEEE Radiation Effects Data Workshop, pp 1-11, 1992.