IC Troubleshooting and Failure Analysis: Find the Facts and Avoid the Guesswork

Abstract: When troubleshooting a complex device, knowledge is king. We want, and need, to know everything relevant to the issue, including the proper IC revision number, where to find relevant reference materials, and who really knows what happened at the customer's site. Failure analysis of ICs requires a quick and proper response because, of course, helping a customer is our main concern. But should we expect the quality assurance (QA) department to test every parameter over all conditions during a failure analysis (FA)? No, not at all. Too much of that is guesswork. It may surprise some people, but QA people do not have crystal balls nor do they read minds. Timely and effective IC troubleshooting is only possible when precise technical information about an IC failure is available from the customer.

A similar version of this article appears on EDN, October 31, 2012.

Failure Analysis of ICs—It Can Waste Time

We often hear, "perception is reality." When an IC fails, or the customer thinks that it failed, we must respond with a failure analysis (FA). To do that effectively, we must have accurate, pertinent information about the incident. That is the only way to avoid guesswork.
Let me relate an incident that happened not so long ago. A part was returned as a failure and we knew nothing else. We ran it on the automatic test equipment (ATE), and bench tested, X-rayed, and decapped the part. We flooded it with soft electrons in an electron microscope to look for emission sites indicating damage. We measured its temperature using a liquid crystal coating. The part was perfect. We found no reason for failure, so the QA department said exactly that in the FA report. Why, we wondered, was the part returned as failed?
About two months later we learned almost by accident that the customer experienced this failure only when the part was heated above +60°C. We started the FA again. We tested the part at room temperature (+25°C), and we found… nothing. The part no longer functioned as it was destroyed in the process of testing it. Ultimately, this was a one-time return event; it did not happen again. But there was something more important learned in this episode: without crucial performance (i.e., failure) data we were blind and guessing. We wasted considerable time and money for nothing. (See the Appendix—IC Failure Analysis on the Homefront for another more personal story of antique cars, grounding issues, and another failed IC.)

An Exhaustive Exercise in QA Futility

Many times a failed IC is so damaged that the origin of the damage cannot be determined. One customer took a board from the assembly contractor back to their lab facility. There they removed the IC from the board and claimed that the IC failed. Very likely. The customer came to a conclusion: a "root cause" in the IC itself. They wanted an FA, but where was the failure data? Were the circumstances recorded carefully? What would prevent future failures? We were back to guessing, not fact checking—hardly a prescription for a meaningful FA.
In this case the customer had concentrated on three pins of a multi-output device. Here is what we did know: the part left the fab operating with a certainty of a few parts in billions; it operated in a circuit for hours before it failed. Was it an infant failure or was it damaged by external handling? Had it been in the customer's circuit? In the application environment? Did electrostatic discharge (ESD) at the factory weaken the circuit so it failed later? Perhaps there was damage by a shipping clerk who ignored an ESD protocol? The list of possible factors seemed endless.
The first partial schematic received from the customer was not very helpful. It showed neither what drove the failed part nor what the part needed to drive. The local FAE was asked to check the ground. Were the grounds separated correctly? You could not tell from the schematic. We received a few more pieces of the schematic, but now had more questions than answers. Why did the customer check at only three of many outputs? Were any input or output pins of the device connected with low impedances to board pins? Was the power and ground count as low impedance connections? Could ESD on the board pins be the issue? We were still guessing.

Effective Failure Analysis—Troubleshooting a Crime Scene

Now we ask, "What might be accomplished with the proper information from the outset?" Was it reasonable to expect QA to exhaustibly test every parameter over all conditions, especially when we knew nothing specific about the failure? No. We can only help a customer understand why an IC failed and we can only correct it with adequate knowledge of the application.
This approach, admittedly, conflicts with those who believe that an FA should be performed with no delay. I've heard that "an FA is always the first thing to be done. Looking at the internal parts of the IC should be done before looking at the IC in the application circuit." I do not understand where that idea originated, and I disagree. The FA is not the first task. Rather, investigating the "scene of the crime," the failure incident, is the first step.
The information at the failure location is critical and, like police investigators, we should go to extreme lengths to preserve the on-scene data. The first thing is to investigate the IC in the application circuit, i.e., where it failed. A simple thing like a solder splash may be the key to the answer. The IC might be partially operational but not totally failed. In fact, removing the IC could mask the real problem.
For an effective FA, we need to check a customer's schematic diagram and gather all the circumstances, the reasons, for the failure. Yes, this procedure may well confront a customer's confidentiality issues. This is a common concern, which is why there are nondisclosure agreements (NDAs). This is also a situation where FAEs serve as the factory's eyes and ears on the ground all over the world. FAEs can go into the customer's facility and evaluate the schematics, layout, and other conditions for the application. To protect customer confidentiality, the FAE need only send QA the relevant parts of a customer's design schematic. And now, finally, QA will be working with credible failure data.

A Successful Outcome

Back to our story. The local FAE became more closely involved with the customer on this failure issue. With more schematics in hand, here is what little we saw. An op amp connects to an output pin, but it should have little effect because of the 10kΩ series resistor. By using one common ground, not separate grounds connected at one star point, noise on one supply is directly coupled though the decoupling capacitors to other supplies. The smallest decoupling capacitors are 0.1µF. Typical surface-mount 0.1µF capacitors are self-resonant at about 15MHz; above that frequency they are inductors and cease to function as capacitors.
There are two lessons from this. First, decoupling capacitors are a two-way street. If one couples a noisy power supply to a quiet supply, the noise will contaminate the quiet supply. Second, the same thing happens with a noisy ground: the noise will contaminate the quiet supply. Noisy supplies need to be paired with a noisy ground and clean or quiet supply must be paired with clean power. Cross-contamination can hurt both powers and grounds. Above the capacitors' self-resonance frequency, it becomes inductive, that is, it does not conduct or attenuate high-frequency energy.


So we come full circle and repeat an opening comment: knowledge is king when troubleshooting an IC failure. From the outset of an investigation, no one is more valuable than the local FAE who examines the issues side by side with the customer. The FAE must scrutinize the whole system, board layout, schematics, and application, and then convey that data back to QA. Only with accurate, detailed incident data can we solve IC failure issues. Without that data, QA is forced to guess about the "scene of the crime."

Appendix—IC Failure Analysis on the Homefront

Here is a related story that illustrates why knowledge is king when analyzing a failed electronic circuit. Without all complete failure data, it is impossible to derive an accurate FA. This story does not start out as an IC troubleshooting issue, but quickly evolves into that.
A friend has an old Model A Ford® automobile built between 1927 and 1931. He installed a radio purchased from a local auto parts store. It failed when installed. He took the radio back to the store and they replaced it. He installed the new unit and it failed. After the third "bad" radio, the store refunded his money.
He started talking to members of an antique car club. They told him that the Model A has a positive ground, so the radio had its power-supply leads reversed. While the radio expected to connect to the positive voltage, it was actually connecting to the negative voltage. Smoke in the semiconductors happened when the power supply is reversed.
The Model A saga continued. With knowledge of the positive ground, our friend bought an expensive custom-built DC-to-DC converter to invert the power voltage. To test it, he connected the battery to the DC-DC converter and the radio on his work bench. It worked well. He then mounted everything in the car and the fuse blew. Finally, he asked this engineering friend for help.
The chassis of the Model A is connected to the positive terminal of the battery. (In today's electronics, that is equivalent to a negative power supply.) American cars after 1956 have a negative ground; the battery's negative terminal is connected to the chassis, making a positive power supply. Consumer items bought at auto stores today presume a negative ground in the car. Figure 1 below worked on the bench because the radio was not bolted to the chassis of the car.
Figure 1. This setup worked on the bench because the dotted-line chassis ground was not connected on the radio.
Figure 1. This setup worked on the bench because the dotted-line chassis ground was not connected on the radio.
There is no ground isolation inside the DC-DC converter to save cost; in fact, the positive input and positive output are tied to chassis ground. So when the setup was on the bench, it worked because the dotted-line chassis ground was not connected on the radio. As soon as the radio was mounted in the car, the radio chassis shorted the power supply, thus blowing the fuse.
Suppose you are the technician at the radio company and tasked with performing an FA on these returned radios. The local parts store says only what they know: "The radios failed when installed." You open the radios to find lots of burnt parts. What caused the problems? Without more specific performance data you are guessing. As we have said, any QA engineer needs the whole failure story to be able to recommend effective corrective action.