チュートリアル 4705

The Art of Hot-Swapping in Telecom Systems: Avoid a Patchwork and Implement a More Effective Solution

筆者: Hamed Sanogo

要約: This application note discusses the important role and optimal circuits for hot-swapping in high-availability systems. It uses a telecom system as an example of an ensemble of embedded microprocessor-based cards plugged into a mid- or backplane. Classified as \"high-availability\" systems, these systems are not supposed to be powered down for service or repair. The article defines the term \"5-NINEs availability,\" which translates to almost zero downtime. This level of availability can only be achieved when the cards are serviced by hot-swapping them in and out without powering down the entire system. The article then describes hot-swap circuits. It shows some patchwork approaches to hot-swapping and explains the deficiencies of these methods. The article concludes with a discussion of newer, more integrated hot-swap controllers that overcome the problems with prior designs.

A similar article appeared on appeared July 19, 2010 on the Planet Analog website.

Introduction

Like many other complex multicard systems, a telecom system is an ensemble of embedded microprocessor-based cards plugged into a mid- or backplane. Classified as "high availability" systems, they include private branch exchange (PBX), cellular base transceiver station (BTS), blade center Telco (BCT) servers, network data communication and storage systems. Once up and running, these systems are not supposed to be powered down for service or repair.

The term "5-NINEs availability" is often used to describe these systems. It means 99.999% availability, which translates to almost zero down time. This level of availability can only be achieved when the cards are serviced by hot-swapping them in and out without powering down the entire system. One must then be able to repair, upgrade, configure, and sometimes even expand the system on the fly without disturbing the rest of the system.

This article discusses some of the patchwork solutions that board-level design engineers currently use when designing hot-swap circuits. A discussion of some new-generation innovative, hot-swapping solutions follows. The term "hot-swapping" will be defined with emphasis on voltage transients. Solutions that circumvent the detrimental effects of poorly done (patchwork) hot-swapping will be shown. The discussion closes with the most recent innovations in hot-swapping technology.

Hot-Swapping Events: Understanding the Transient State

Figure 1. A multi-PCB chassis-based system.
Figure 1. A multi-PCB chassis-based system.

The Hot-Swap Event: Peak Inrush Surge Current at Card Insertion or Removal

Hot-swap refers to insertion or removal of cards, cables, and other items into or from a fully operational live system without first disconnecting power from the system. Properly done hot-swapping of the cards should not cause any perturbations of the power supply or the system's input and output signals.

When a fully operational chassis-based system has all its plug-in cards in the chassis (Figure 1), these cards are all powered. This means that each card has all its bulk and bypass capacitors fully charged. The bulk capacitor at the input of the supply allows the power-supply designer to accomplish two important tasks: provide good power quality to the downstream regulators on the card, and replenish the smaller distributed bypass capacitors which supply the transient demand of the load.

When another card, uncharged from the shelves, is plugged into the live backplane several things can happen. Refer to Figure 2. The bypass and filter storage bulk capacitors of the newly inserted and powered PCB act like a short quickly and start to charge. Some of this charge comes from the live system, capacitors C1, C2, and C3. (The already charged capacitors from the other cards will all discharge as a result). This uncontrolled charging (or discharging) of the capacitors on the previously inserted cards creates a large inrush current into the bulk capacitors of the new card. Depending on the system, the inrush current can reach magnitudes of hundreds of amperes during very short amounts of time.

As the capacitors quickly charge, they appear as a short and instantaneously draw a large amount of current. Figure 3 shows a plot of inrush current into a bulk electrolytic capacitor and the voltage across the capacitor as it charges up. As shown on the plot, the peak current reaches 9.44A. Producing a large demand on the system, this can cause the chassis system's capacitors to discharge. This results in a voltage drop, possibly causing the adjacent cards to reset, which could introduce an error in the transmitted data or other system glitches.

The magnitude of the instantaneous surge current is a function of the load (early power) capacitance. The larger the load's capacitors (and the lower their ESLs and ESRs), the higher the peak inrush currents.

Figure 2. Sequence of board insertion and inrush current at power-up.
Figure 2. Sequence of board insertion and inrush current at power-up.

Figure 3. Plot shows the inrush current into a bulk electrolytic capacitor and the voltage across the capacitor as it charges up.
Figure 3. Plot shows the inrush current into a bulk electrolytic capacitor and the voltage across the capacitor as it charges up.

Impact of Voltage Transients on the Systems Can Be Catastrophic

As is the case for any system, the power supplies in these chassis-based systems are current limited. The voltage transients that occur during a hot-swap event can have a large impact on the cards already plugged into the chassis. The inrush phenomenon can result in a significant collapse of the chassis supply rail, a voltage drop of the backplane power bus, and/or power-supply glitches that could inadvertently generate system resets. This unrestricted current surge can also cause physical damage to the components: destruction of the card's bypass and bulk capacitors, printed circuit board (PCB) traces, backplane connectors' pins, and/or the blowing of fuses (which can be a major nuisance).

Quite often there is a drop in the backplane's power bus, which causes power perturbations or supply glitches on the cards plugged into the system. The adjacent cards could also either experience unwanted resets or the communication signaling on the backplane between cards can be affected (e.g., a bit error is induced). Backplanes generally use differential buses (LVDS/LVPECL/Fiber Channel/others), which must meet certain signaling specifications to ensure proper signaling performance. A hot-swapping event can affect their common-mode noise specifications by introducing voltage variations on the VCC and ground planes. Given the potential deleterious effects of a hot-swap event, a well-implemented hot-swap circuit must ensure that the hot-swap does not generate large enough noise on the backplane to cause error on the data carried on these buses.

Another problem most often ignored by designers is the long-term reliability of the system. Poorly designed hot-swap-protected circuits cause components to be slowly stressed by each induced hot-plug event. Essentially, every time a hot-swap event takes place, its effects are akin to an attempt to "pull" (in order to detach) the bond wire from the silicon in the package. This repeated stress will cause catastrophic failure over time. The best remedy for this phenomenon is to control peak current and inrush current on hot-swap cards.

Patchwork Inrush Control Implementations

There are several known ways of implementing a control solution for inrush peak current. Some methods are based on sound engineering analysis, while others are a poorly designed means to mitigate the effects of hot-swapping in systems. These latter approaches are described below as patchwork implementations.

Precharger Pin or "Early Power" (i.e., Resistor Approach)

One way to achieve inrush current control in an application is to use the "staggered-pins" approach, also known as "early-power pins," "precharge voltage," or "preleading" pins. The staggered-pins implementation provides a physical means to ensure that the new card is seated correctly and that the connections are made in a timely fashion. This inrush current-control implementation can also be used in combination with a resistor to limit current during the hot-swapping event.

The precharge pin solution, one of the most basic hot-swap solutions, implements a connector with a combination of long and short power pins. Refer to Figure 4. The long power pin mates first and starts charging the new card's filter and bulk capacitors through a series resistor, RPRECHARGE. RPRECHARGE limits the current drawn. Near the end of the card-seating process, the short power pin mates, bypassing RPRECHARGE connected to the longer pin and creating a low-impedance path for powering the card. The signal pins usually mate last to complete the seating process.

Figure 4. Smart connectors enable hot-plug capabilities.
Figure 4. Smart connectors enable hot-plug capabilities.

The protection device in this case is a resistor, RPRECHARGE, which protects the card by limiting the inrush current to a level that will not damage the pins or disturb the voltage rails on the adjacent cards. Some engineers add an inductor and/or a diode to ground to this basic implementation.

This article considers the precharge pin approach as a "patchwork" hot-swap solution because the bulk input filter capacitors' charge rate is still impossible to control. There are two main issues with this scheme: the variations in the length of the short pins relative to the long pins, and the fast versus slow insertion time of the card into the system by a service technician. Ultimately, this is a mechanical solution; pins of the same nominal length may not necessarily make contact at exactly the same time due to the mechanical tolerances of the connectors. This is why users can experience the variations mentioned above. Moreover, if the short power pin is a bit longer, and if there is a very fast insertion time for the PCB into the chassis, then RPRECHARGE can be shorted out before the bulk input capacitor has the chance to fully charge. This scenario is quite plausible and, thus, partially negates the attempt to control inrush current.

Another important step is sizing RPRECHARGE. This is not an easy task and can impact the system if the resistor is not properly specified. The value of this precharge resistor must be sized to equalize both the precharge and the main inrush currents.

Lastly, the staggered-pin implementation requires a specialized connector, which has historically been cost prohibitive.

As one can see from the above arguments, the short comings of the precharge pin scheme are quite important. It is very limited and difficult to implement with a reliable level of precision. It does nothing to regulate current at startup, nor does it provide output overvoltage (OV) and undervoltage (UV) monitoring.

Thermistor (Current-Time Characteristic) Approach

Another hot-swap implementation scheme is the thermistor hot-swap approach. A thermistor is an electronic component that shows a significant change in resistance with a change in its temperature (i.e., a change in resistance as a function of heat). It is commonly used in circuits where temperature-dependent regulation is needed. The current-time characteristic for a negative-temperature-coefficient (NTC) thermistor depends on its heat capacity, its dissipation constant, and the circuit in which it is used. This current-time characteristic can be used to discriminate against high-voltage spikes of short duration and against initial current surges. Figure 5 shows a thermistor-based hot-swap current-limiting circuit with an external MOSFET.¹

Figure 5. Thermistor-based hot-swap circuit implementation.
Figure 5. Thermistor-based hot-swap circuit implementation.¹

When employing the thermistor-based approach, proper consideration must be given to the peak instantaneous power applied to the thermistor. The designer must consider the circuit board's environmental temperature variations (copper area and airflow), and the fact that the thermistor device itself can be damaged if its voltage and/or current ratings are exceeded.

There are several disturbing concerns with this thermistor approach. In the telecom industry, for example, a card is not expected to be redesigned after the initial release of the system to telecom carriers. Consequently, a thermistor can cause a long-term reliability issue. One must also consider the reaction time of the negative temperature coefficient (NTC). Another closely related problem arises if the card is repeatedly inserted into and removed from the chassis. It is quite possible that the thermistor will not cool off enough to limit inrush current effectively by the next insertion event. Finally, the characteristics of the thermistor will most likely change over time, thus making the system vulnerable.

So in summary, while this approach can do a good job in temperature-dependent applications (e.g., LCD bias supplies) and can limit peak inrush current, the thermistor-based hot-swap circuit does not offer the extended benefits needed for a reliable, long-term hot-swap implementation.

Discrete Hot-Swap Circuits

Yet another way to achieve inrush current control is with several discrete components. (Admittedly, many designers might not consider this a patchwork solution.) Usually, the fault protection, circuit breaker, and current control functions are all done in separate circuitries with separate power MOSFETs, power-sense resistors, and other discrete biasing components. These discrete hot-swap circuits can not only be complex and hard to debug (this alone increases design and validation time), but can also have higher costs and require more PCB real estate.

The important issue with discrete hot-swap circuits is the effects of the parasitic elements of the passive discrete components. This is a crucial, and critical, consideration for the designer. These circuits use resistors and capacitors to control the rise and fall times, voltages and current, and other sensing conditions. The designer of this system has the nontrivial task of paying special attention to how the parasitic elements affect the operating conditions of the circuit.

After assessing the above three patched-together approaches for hot-swap implementations, there is still a better way. In fact, the best way to assure the design's long-term protection and reliability is to use a complete and integrated hot-swap solution embedded into a single monolithic die. The next section discusses some of the industry's most innovative hot-swap solutions, including the MAX5961 hot-swap controller.

True Inrush Peak Current Control

Higher Levels of Integration

An engineer can positively impact the long-term reliability of the hot-swappable embedded card by using a circuit that limits the inrush current to the inserted card, protects against overcurrent conditions and load transients, and maintains a reduced number of failure points. There are hot-swap controller ICs on the market with higher levels of integration; some controller ICs no longer need a sense resistor. Many other ICs have made implementing a hot-swap circuit an easy and very effective task. One can, for example, find the following functions supported in single part: UV and OV protection; active current limiting with a constant-current source during overload; electronic circuit breaker with faulty loads disconnecting before supply dropout; reverse current protection with an additional drive FET to provide an "ideal diode;" multiple voltage sequencing; digital voltage and current monitoring; and automatic retry after load fault.

A few analog semiconductor suppliers have introduced a wide variety of hot-swap solutions to meet a large number of system requirements. The newest generation of hot-swap ICs offers a wide variety of analog and digital features, such as the ability to continually monitor the supply current long after the card has been seated and powered up. This monitoring feature ensures that the card is continuously protected against a short circuit and overcurrent conditions during the normal operation. Continuous monitoring also allows malfunctioning cards to be identified and removed from the system quickly before they can completely fail and precipitate downtime.

The Importance of an Integrated ADC

Maxim, Analog Devices, and Linear Technology® have hot-swap solutions which come with digital faults and statistical data (or flight) recording features. A recent new term, "digital hot-swap" IC, refers to hot-swap solutions that integrate a high-performance ADC for voltage and current monitoring. Table 1 compares some of the key specifications for hot-swap ICs from these suppliers. The MAX5967 is not in the table, but is pin- and function-compatible with the LTC4215.

Table 1. Digital Hot-Swap IC Comparison
LTC4215 ADM1175 MAX5961 MAX5970
ADC Resolution (bits) 2 12 10 10
Conversion Rate (Hz) 10 Not Specified 10k 10k
Automatic or Polled? Auto Polled Auto Auto
History "Depth" 1 sample 1 sample 50 samples 50 samples
INL 0.2 LSB, 0.5 LSB Not Specified 0.5 LSB 0.5 LSB
Full-Scale Error (voltage, current) ±5.5 LSB, ±5.0 LSB ±60.0 LSB, ±100.0 LSB ±10 LSB, ±30.0 LSB ±10 LSB, ±30.0 LSB
Interface I²C/SMBus™ I²C I²C/SMBus I²C/SMBus
High-Speed Voltage (min, max) 2.9V, 15V 3.15V, 13.2V 0V, 16V 0V, 16V
GATE Pullup Current (µA) 20 12 5 5
GATE Pulldown Current, Normal (mA) 1 2 500 500
Slow-Trip Circuit-Breaker Threshold (mV) 25 85 12.5, 25, 50 (and 8-bit programmable) 12.5, 25, 50 (and 8-bit programmable)
Fast-Trip Circuit-Breaker Threshold 115mV 125%, 150%, 175%, 200% of programmed slow trip 125%, 150%, 175%, 200% of programmed slow trip
Load UV Protection Analog 2 each, 10-bit programmable 2 each, 10-bit programmable
Load OV Protection 2 each, 10-bit programmable 2 each, 10-bit programmable

The embedded ADC in these devices gives the hot-swap controller IC the extended ability to monitor and report the power-supply states and other vital signs at the instant that the fault occurs. The MAX5961 also stores several milliseconds of past voltage and current measurements. This data can be used to ease system debugging and failure analysis later.

The integrated ADC has also created opportunities for OEMs to become more creative with their products. One can observe an increase in value-added features for advanced board management:
  • Information gathering: a designer can use a system's vital data collected today to build a next-generation system with optimized efficiencies.
  • Constant monitoring: during normal operation of these high availability systems, there could be a desire to log certain "vital statistics" of the power levels of the card through a constant monitoring of the power temperature levels. This can be used later for "predicting certain specific faults."
  • Power budget: by reading past and current fault conditions, one can ensure that no embedded card is using more than its share of the total power budget. This monitoring will facilitate early identification of abnormal operating conditions and help mitigate, or eliminate, any effects on the rest of the system.

The I²C Link to the System Microprocessor

The controller's I²C interface is used by card's microprocessor to collect the vital statistics mentioned above. Through this interface the controller is configured to behave, latch off, or restart continuously; it is how a problem card is identified early on by the system's management firmware. This interface is essentially the chassis's warning display to the service technician. It serves much like the service-engine-soon light seen on the dashboard of a car.

Conclusions

Hot-swapping PC boards in high-availability systems is an inevitable necessity. Nonetheless, tracing a PC board malfunction caused by inrush current after an insertion event is a very challenging task. Understanding the malfunction, or preferably preventing one, is complicated by any patchwork hot-swap solutions that inevitably create a more negative impact on the system's long-term performance than the engineer can ever imagine.

Today's highly integrated hot-swap solutions will ensure that a hot-plug event in a system does not cause data-transmission errors or resetting of the cards already in the system. These solutions will help sustain a system's long-term reliability. In the end, the goal is all about meeting and exceeding the 5-NINEs.



References
¹Maxim application note 1785, "Flexible Hot-Swap Current Limiter Allows Thermal Protection."