Troubleshooting is more art than science. When diagnosing a problem, the most important tool is an intimate understanding of your network: what connects to what, and where everything is both logically and physically. You almost need to visualize the packets going from one device to the next.
That’s where network diagrams, topology mapping, and cabling spreadsheets become extremely important. If you don’t know where that errant but critical device is connected, finding the right connections by tracing cables is an exercise in frustration, perhaps futility.
In addition to keeping an up-to-date account of network topology, both physical and logical, I like to prepare for troubleshooting sessions by keeping the description fields on my switch and router interfaces up to date, allowing me to find things quickly. Do a “show interface brief” on your switches. Every active port should have a description, and the most useful and important information should be first so that it doesn’t get cut off.
Start with ping
Let’s suppose you have a device that isn’t responding. The first thing to verify is whether it’s even on the network. Ping it. This is the network equivalent of, Is it plugged in? Is it turned on? If you know the destination device by its hostname, and if DNS is working, ping will also tell you the IP address.
C:\Users\Kevin>ping www.auvik.com Pinging www.auvik.com [126.96.36.199] with 32 bytes of data: Reply from 188.8.131.52: bytes=32 time=36ms TTL=55 Reply from 184.108.40.206: bytes=32 time=30ms TTL=55 Reply from 220.127.116.11: bytes=32 time=29ms TTL=55 Ping statistics for 18.104.22.168: Packets: Sent = 3, Received = 3, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 29ms, Maximum = 36ms, Average = 31ms
Are you on the same subnet as the destination device? If you are, you can get its MAC address from your address resolution protocol (ARP) table. On Windows it looks like this:
C:\Users\Kevin>arp -a Interface: 10.10.80.2 --- 0xd Internet Address Physical Address Type 10.10.80.1 00-ff-3d-be-ac-6b dynamic 10.10.80.6 20-c9-d0-ac-22-a1 dynamic 10.10.80.255 ff-ff-ff-ff-ff-ff static 22.214.171.124 01-00-5e-00-00-16 static 126.96.36.199 01-00-5e-00-00-fc static 188.8.131.52 01-00-5e-7f-ff-fa static 255.255.255.255 ff-ff-ff-ff-ff-ff static C:\Users\Kevin>
Here you can see that I’m 10.10.80.2, and I know about two other devices on my segment, 10.10.80.1 (the router) and 10.10.80.6 (another computer).
On a Cisco device, get the ARP entry using the “show ip arp” command. Actually, I’d generally use “show ip arp | include
Router1#show ip arp | include 10.10.80.6 Internet 10.10.80.6 8 20c9.d0ac.22a1 ARPA Ethernet0 Router1#
Most ARP addresses are learned dynamically. If the device has never worked, the Cisco device shows an “incomplete” entry. On Windows it generally just won’t appear in the list. If it’s not there at all, ping it first. That will force the ARP protocol to attempt to discover it.
Router1#show ip arp | i 10.10.80.65 Internet 10.10.80.65 0 Incomplete ARPA Ethernet0 Router1#
Tracking down the MAC
Once you have a MAC address, the next step is to find which switch port it’s connected to. On a switch, the command is “show mac address-table address <-address->.” Note that Cisco has changed the syntax of this command. On some switches it’s “show mac-address-table address.”
Switch1#show mac address-table address 20c9.d0ac.22a1 Mac Address Table ------------------------------------------- Vlan Mac Address Type Ports ---- ----------- -------- ----- 100 20c9.d0ac.22a1 DYNAMIC Fa0/19 Total Mac Addresses for this criterion: 1 Switch#
This will tell you which port the switch last saw using that MAC address. Look at the status of that port with a “show interface” command. Is it up? Is it a trunk link to another switch?
If it’s a trunk link to another switch, you’ll need to locate the other switch and repeat the process until you find your destination device. Once again, it’s extremely helpful to have good descriptions on your interfaces. Is this the right interface for this device?
I have mixed feelings about Cisco Discovery Protocol (CDP). It’s very useful for figuring out where things are connected.
Router1#show cdp neighbors Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge S - Switch, H - Host, I - IGMP, r - Repeater Device ID Local Intrfce Holdtme Capability Platform Port ID Router2 Ser 0/0 179 R 2621 Ser 0/1 Switch1 Fas 1/0 152 T S WS-C2924 2/2 Router1#
The above output shows two other Cisco devices on this network, allowing you to easily map out your trunk links and other links between Cisco devices.
The problem with CDP is that it’s almost too useful. All of that information is flooded throughout the network. If I’m connected to this network, I can quickly learn the name and management IP address of the switch. Even if I’m not connected to the network, if I have a piece of malware that’s running on your PC, I have the same information. For somebody with bad intentions, this is a lot of information.
For this reason, I often disable CDP. And then I miss it.
Some physical connectivity causes of intermittent trouble
The hardest problems to troubleshoot are the ones where everything seems to be okay when you go looking for it, but then they come back later. There are many reasons for intermittent problems, but since this article is mostly about physical connectivity, let’s look at those related to physical issues.
Sometimes a physical cable or an interface will stop either sending or receiving data. This can happen often with fibre optic links, where one physical piece of fibre transmits and another receives signals. One of the devices thinks the link is fine, but half the data is lost.
This is particularly bad in switch to switch links. Switches use a protocol called spanning tree to eliminate loops. If you have a unidirectional link between two switches, switch A will think the link is up and switch B will think it’s down. Switch A will not see any spanning tree packets coming from switch B, so spanning tree will keep the link up. But if there’s another link from switch B back to switch A, we’ve got a loop.
Cisco has a feature called Unidirectional Link Detection (UDLD). Enable this command on fibre optic interfaces.
Loops usually cause high CPU utilization on switches (“show process cpu”). Also, because most of the packets involved in a loop will be broadcasts, and because broadcasts are sent out to all interfaces on a common VLAN, the “show interface” command will show very high values for the five-minute input and output rates on many interfaces. Look in particular at the “broadcast” counters in the “show interface” output.
The biggest problem with real loops is that the entire network can become unusable, a consequence of which is that you can’t log into your switches to figure out what’s wrong.
Faulty cabling or ports
The most common reason for an intermittent physical fault on a switched Ethernet network is a flakey connection. Sometimes the port on the switch is bad. Sometimes a patch cable is bad. Sometimes a cable running to the destination device is bad. Troubleshooting these problems starts with isolating the problem to a single device.
If you have a single device that appears to be consistently misbehaving, one of the things to look for is whether the interface state has been changing.
Switch# show interfaces gigabitethernet1/0/2 GigabitEthernet1/0/2 is down, line protocol is down (notconnect) Hardware is Gigabit Ethernet, address is 2037.064a.0b02 (bia 2037.064a.0b02) MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Auto-duplex, Auto-speed, media type is 10/100/1000BaseTX input flow-control is off, output flow-control is unsupported ARP type: ARPA, ARP Timeout 04:00:00 Last input never, output never, output hang never Last clearing of "show interface" counters never Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 0 bits/sec, 0 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec 0 packets input, 0 bytes, 0 no buffer Received 0 broadcasts (0 multicasts) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 0 multicast, 0 pause input 0 input packets with dribble condition detected 0 packets output, 0 bytes, 0 underruns 0 output errors, 0 collisions, 1 interface resets 0 unknown protocol drops 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 pause output 0 output buffer failures, 0 output buffers swapped out
Near the bottom of this output, you can find a counter labelled “interface resets.” Usually this will be a small number, as in this example. If it’s a big number, wait a few seconds then run the command again. Did it increment? That’s the result of a flakey physical connection.
Try changing the switch port. Does the problem go away or does it move with the cable? If the problem doesn’t go away, then chances are the switch port was fine and the cabling is bad. Try changing the patch cable. Swap out elements one by one until the problem goes away.
Ultimately many physical troubleshooting exercises come down to the process of swapping out elements until the problem goes away. The key is to narrow the problem down so you’re swapping as few elements as possible. Otherwise it could take a while.