Thursday, March 21, 2013

The Reluctant Troubleshooter

This post strays from Comrex specific stuff into general engineering philosophy for a bit.  My son Andrew, who enters high school next year, was offered a course called Introduction to Engineering as an elective (kudos to whoever instituted this option btw). It got me thinking about how my engineering skills developed over time, and what my education was lacking and what I had to learn "the hard way".

One of those major skills was troubleshooting philosophy and discipline. When trying to tackle a tough engineering problem, there are several pitfalls I caught myself falling into in the early days. Luckily I had a good mentor, and between consultation and perseverance was able to "mold" my approach to tough problems in a way that serves me well. But a difficult problem (particularly one with a tight timeline) can be a stressful thing, and there's a certain degree of self-discipline involved in stepping back and evaluating whether you've followed your own rules.

All technical issues, as random as they seem, follow the rules of physics and logic on some level. Even with the best engineers, sometimes one simply doesn't have the time, resources or experience to dive deeply enough to solve the problem using the proper engineering philosophy. We all have our limits. On some level, the quality of an engineer can be defined by the percentage of problems solved using sane logical reason vs. the amount of "hack it til it works".

It's not always easy staying on the path. I've had to tackle a couple of tough ones lately that manifest themselves in what I call the "paradox of doom" (POD). Typically, this involves trying to fix an intermittent failure of hardware. It's universally easier to fix something binary in its failure mode. For the most part, when you fix a software bug it's fixed and done.

But the POD shows itself in hardware much too often. A prime example-- a system presents itself with a failure mode (to put it in Comrex terms, an example would be a 4G modem not being recognized by an ACCESS codec). So you use your logic to narrow things down to the USB interface hardware, and make a list of the usual suspects like EMI, power supply noise, routing, stray capacitance, etc. You figure out a way to simulate an improvement in each of these factors on the bench.

Now, Murphy's law dictates that the last item on your list will resolve the problem, and invariably that's what happens (I'm not by nature a superstitious type, but I must admit to a habit of tackling debug task lists backwards in an effort to foil ol' Murphy). But that's not the source of the POD.

Lo and Behold, the POD rears it's head when you reverse the change. I swear, this is true---that damn circuit will continue to work just fine n' dandy 95% of the time once your fix is removed.
The Internet has come up with a meme to express exactly this emotion


Even the most grounded and logical person is not immune from the despair that comes with spending several hours or days on a toughie, having the thrill of finding something that appears to fix things, and having that hope come crashing to earth under the POD.  Depending on the depth of brainpower and time involved, mitigation techniques can consist of a long walk outdoors, a long hot shower, and eventually working up to a bottle of gin.

I always get annoyed by punditry like this that defines the problem without any solution. So here's my short list of factors that have made a difference to me in the past. Maybe this is useful wisdom to pass down to the next generation, or maybe it's all hot air. You decide.

1) Step away--It might not seem so when you're in the grip of the POD, but to paraphrase that otherwise annoying song, things will look better in the morning. Get busy with something else (like updating your blog) then go home and get some rest.

2) Consult with others-- A half hour with respected colleagues and a white board talking things through will really change your perspective quickly. Even if they can't offer much, simply the act of talking it out and writing it down can reveal hidden truths.

3) Get back to basics--Go back to the list of possible "root cause" of the problem, and evaluate whether you have the tools to properly run them down. Sometimes, an equipment purchase or rental not previously considered will start to look pretty good.

4) Swallow your pride-- Engineering is such a vast science it's not reasonable for everyone to know it all. And it's often not possible to learn (or re-learn) all the factors of something like EMC or transmission lines within the scope of the project. So find a list of competent consultants to look up when you've reached the limit of your experience. Industry contacts, colleagues and social networks like LinkedIn can help.


No comments:

Post a Comment