Years ago a talented coworker admonished us to not troubleshoot networks via “spray and pray.” Caught off guard, I burst out laughing. While I enjoy a good laugh, the sad truth is it was only funny because it was so accurate. We were acting like chimpanzees with Tommy Guns, and we weren’t getting the job done.
There are innumerable network troubleshooting guides floating about the ether, but once in a while it helps to look at what not to do in order to clarify good habits. So here are a few ways I’ve chased my tail so ineffectually that the experience wedged itself into my long-term memory.
aka haphazardly trying whatever feels right that moment
This is the same thing as “spray and pray.” (I can’t explain why geeks keep weaponizing their analogies.) What this method amounts to is focusing on random points in the logic chain and, as each focal point yields nothing, picking another point because it was shiny enough to catch your eye. An example is in order.
You’re seeing packet drops between server A and server B. A logical process might be to start on the edges and work inward. Fire up a sniffer on each side and see if B is receiving all of A’s packets (and verify the host firewall status!). Make sure B is responding.
From there you can move up the path hop by hop and check firewall rule hits, or packet counters on an interface. Something to see where the packets are dropping. I’d probably go about it slightly differently, but there’s a methodology here that will eventually rule out each potential root cause until you find the real one.
“Spray and pray” looks more like this:
- Sniff the interfaces on both sides. B isn’t receiving all the packets A is sending.
- OK, problem confirmed. Check B’s CPU.
- B’s CPU is fine. I remember this happening once before and it was dirty fiber. Let’s clean the local fiber patch.
- That did nothing and now the Director of Software Engineering is wondering why his teams lost their applications for three minutes. Whoops! Pressure is mounting.
- Maybe it’s a firewall rule in the middle dropping packets with some option or value the application is using—let’s look there!
Just writing this second method fatigued me. It usually results organically from a group of geeks with no one willing or able to take charge and enforce at least some type of discipline.
Kick the can troubleshooting
aka pushing the problem on to someone else
Every network administrator I know has grown adept at demonstrating “it’s not the network!”
Simply put, we’re an easy target. A developer sees degraded service on an application that worked fine yesterday and—boop!—opens a ticket for the network team. Meanwhile the application is hosted on a virtual machine (VM) with the same uplink as 12 additional healthy VMs and no one changed anything.
Some tickets or phone calls we get are borderline silly, but there are a few advantages to staying engaged instead of curtly responding that it’s probably a server or application problem.
- You look less arrogant. A soft virtue yes, but making people hate you is widely considered a bad career move.
- You meet and work with players on other teams. You learn how their applications work, how the servers are set up, and who to contact when you need something. Be realistic—you’re probably willing to quietly solve your buddy’s problem before the one that came in from a stranger. Here’s your chance to become that buddy.
- The network team can often help provide information and solve problems faster, even if it’s just by ruling out a bunch of theories via information you can quickly access.
- You might be wrong! If you carefully tell someone it looks like a server issue and it turns out to be your mistake, you can sheepishly admit it and hopefully smooth it over. If you announce your finding with 100% certainty and end up wrong, whoever you blamed may take the opportunity to publicly knock you off that horse.
In short, leave the ticket open a little while if you can. Even better, offer to join a call even if you know it’s not your problem.
aka troubleshooting in 5-minute chunks
If you’ve ever had a low-priority problem nag at you for six months, you know what this is about. Someone wants to see if you can reduce latency across the country by spreading various traffic types over different providers, but it’s not really time sensitive. So you spend 15 minutes gathering statistics before something else forces you away. It takes two weeks to come back because it’s not time sensitive, right?
The problems with this are legion. First, you forget almost all of what you did and why, because you’ve dealt with 30 problems since then. So you go to your notes, which probably aren’t strong enough to compensate for your understandable amnesia.
So you start over. It’s quicker this time because you remember what you did and have notes, so instead of 15 minutes it takes 10, most of which is spent logging into things and verifying nothing changed. You make another 10 minutes of solid progress before someone wants to go on a group coffee run.
This is a horrible way to accomplish anything—reading a book, getting in shape, learning to dance. I’ve had things on my to-do list for a year sometimes because of this. There’s no substitute for sustained, focused effort.
There are plenty of bad ways to troubleshoot. The overarching rules of thumb I’ve learned from experience are to be actively engaged, work patiently and methodically, and don’t leave things half finished.