Routing problems tend to emerge when you’re first setting up a new piece of network equipment, and when something has failed. Usually routing problems are caused by some sort of configuration or design error. Troubleshooting routing problems is tricky because the usual tools like ping and traceroute don’t always tell you what you need to know.

Let’s start with the basics of how a packet is routed through a network, which illuminates critical subtleties that are useful when troubleshooting.

The originating device puts three important parameters into the IP packet header:

  • The source IP address, which is the address of the device itself
  • The destination IP address, which is where the packet is going
  • The IP protocol, such as UDP or TCP or ICMP

In the case of UDP and TCP, there are two additional numbers, both of which are important: the source and destination port numbers. The destination IP address is what we normally think of in routing, but actually the network can route the packet using any combination of these values.

Another parameter called time to live (TTL) governs how far away the destination can be. The name is deceptive because it doesn’t really have anything to do with time. TTL is a hop counter that keeps track of how many times the packet has been forwarded, and is used to prevent loops.

The first task of the originating device is to look up the destination address in its own internal routing table. On Windows, use the “route print” command.

C:\Users\Kevin>route print
[…]
===========================================================================
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0       10.10.80.1       10.10.80.2     25
       10.10.80.0    255.255.255.0         On-link        10.10.80.2    281
       10.10.80.2  255.255.255.255         On-link        10.10.80.2    281
     10.10.80.255  255.255.255.255         On-link        10.10.80.2    281
        127.0.0.0        255.0.0.0         On-link         127.0.0.1    306
        127.0.0.1  255.255.255.255         On-link         127.0.0.1    306
  127.255.255.255  255.255.255.255         On-link         127.0.0.1    306
        224.0.0.0        240.0.0.0         On-link         127.0.0.1    306
        224.0.0.0        240.0.0.0         On-link        10.10.80.2    281
  255.255.255.255  255.255.255.255         On-link         127.0.0.1    306
  255.255.255.255  255.255.255.255         On-link        10.10.80.2    281
=========================================================================== 

This example shows a lot of destination networks, but really only two of them matter. The first line is the default route. Network 0.0.0.0 with mask 0.0.0.0 matches any destination. This default route points to a next hop device, my router, 10.10.80.1.

The other entry in this routing table that matters is the second one, for 10.10.80.0 with a mask of 255.255.255.0. This matches any destination between 10.10.80.0 and 10.10.80.255, my local network segment, which includes my router.

Based on this table, my PC knows to forward this packet to my router. To do so, it uses the IP packet in an Ethernet frame with the router’s Ethernet MAC address in the destination field and its own Ethernet MAC address in the source field, so that the router knows how to forward return packets.

The router strips off the Ethernet frame and looks in its own routing table to know how to reach the destination IP address.

Router>show ip route
Codes: C - connected, S - static, I - IGRP, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2, E - EGP
i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, * - candidate default
U - per-user static route, o - ODR

10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
C 10.10.80.0/24 is directly connected, Ethernet0
C 10.10.1.0/24 is directly connected, Ethernet1
C 10.25.8.1/32 is directly connected, Loopback0
S 0.0.0.0/0 [1/0] via 10.10.1.4
[…]

This routing table uses CIDR “slash” notation instead of separate network and mask, but it conveys similar information to the PC’s “route print” command. Look for the entry that matches the destination IP address. Let’s assume it’s the default route, 0.0.0.0/0. The router sees that it points to a “next hop”, 10.10.1.4, which is another router.

The router then creates a new Ethernet frame using the MAC address of this next hop router for the destination address, and wraps it around the original IP packet. The only important change it makes to that original packet is to decrease the TTL value by one.

The whole process is repeated until the packet is delivered to the destination.

The tools

The standard go-to tools for troubleshooting routing problems are ping and traceroute.

Ping is a very simple minded tool. It sends an Internet Control Message Protocol (ICMP) “echo request” packet to the destination device, which sends back an “echo response” packet. ICMP is a special IP protocol, different from either TCP or UDP. ICMP packets don’t contain source or destination ports, just a “type,” such as “echo request” or “echo response.” That’s it. If the request can get all the way through to the destination and the response can get all the way back, then you know you have Layer 3 connectivity.

The problem with ping should be obvious from its description: It tells you nothing at all if you can’t reach the destination. This is where traceroute comes in. Traceroute also sends packets (either UDP or ICMP depending on the implementation) to the destination IP address and looks for a response, but it actually tries several times while manipulating the TTL field that I mentioned earlier.

The first time it tries, traceroute sends the packet with a TTL value of zero. No router is supposed to forward a packet with a TTL value of zero. So the router drops the packet and sends back a special “TTL exceeded” type ICMP packet to the source. Traceroute reports the IP address that appears in that ICMP packet. Now you know the first hop. It will generally do this three times, just to make sure the route is stable.

Then traceroute increments the initial TTL value and sends another packet. This time the first router sees a TTL value of 1, decrements it to 0 and forwards it to the next hop router, which drops it and sends back an ICMP message. Traceroute displays the IP address of that router. This process repeats with initial TTL values of 2, 3, 4, and so on until the destination is reached.

Traceroute will often show you several hops, followed by line after line of “* * *”, which means that it didn’t get back the “TTL exceeded” message. This usually means that the last device you saw explicitly listed is the last one that has a good route to the destination. Whoever it forwarded that packet to didn’t know what to do with it.

But it’s also possible that the packet was forwarded, but you just didn’t get the “TTL exceeded” message. Sometimes firewalls in particular will refuse to send this message. And sometimes firewalls will actively block these packets from all downstream devices. So it’s not conclusive, but it gives you an idea of where to start looking for trouble.

The other interesting thing you sometimes see in a traceroute session is multiple next-hop IP addresses for the same TTL value. This tells you that there are actually multiple paths to the destination, all with the same routing cost. This is only a problem if there are firewalls in the path. A firewall will generally object to forwarding response packets back to a source if it didn’t see the original packet going the other way. To the firewall, this looks like a protocol violation, so it will usually drop the unexpected packet.

Routing loops

The other thing that traceroute will sometimes show you is a loop. Somewhere along the path you’ll see an IP address that you’d already seen. That is, you’ll see a path that goes from router A to B, C, D, C, D, C, and so on. This tells you that router C is forwarding the packet to router D, which forwards it back to C.

This is actually the reason the TTL field exists. The source device will usually use the maximum value of the TTL field: 255. In a loop, the TTL value will eventually decrement to 0 and the packet will be dropped. There are no infinite loops in IP, but it’s still a bad thing because your packets aren’t getting through and congestion problems could result.

If you see a loop, you’ll need to figure out what the path is supposed to be and to adjust the routing tables of the looping devices. Typically you’ll see loops in situations where a dynamic routing table is in conflict with a static route on one or both of the routers in question. This could happen, for example, if you have a static default route on one of the devices pointing to the other one. Then if the more specific route to your destination disappears for any reason, the router will use the default route and send the packet back where it came from.

Protocol filters and policy routing

Suppose ping and traceroute say everything is fine, but your application packets still aren’t getting through. This is typically due to either a filter of some kind or policy routing.

Protocol filters are also called access control lists (ACLs). You can find these filters on Cisco routers, switches, and firewalls by searching the configuration file for “access-group” commands, which apply the ACL to an interface.

An ACL can allow one type of traffic and block another type. For example, you might find that the ICMP ping packets are allowed but your application traffic is not. In this case, the routing tables will look right and the ping and traceroute tests will work, but you won’t be able to run the application.

Policy routing (also called policy-based routing or PBR) can cause even stranger problems if it goes awry. Policy routing means the router will override the routing table when making its forwarding decisions. Instead it might make its decisions based on the source IP address, protocol or port number. So the router could be forwarding the ping packets through one path and the application traffic a completely different way.

If you suspect that policy routing is causing your problems, the first thing to do is to look at the router configuration files for an interface configuration block that includes an “ip policy” command. This command will refer to a route map, which in turn will define how the packets are to be routed.

interface Ethernet0
 ip address 10.10.5.1 255.255.255.0
 ip policy route-map FUNKYROUTING
!
route-map FUNKYROUTING
 match ip address 100 
 set ip next-hop 10.10.6.1
!

In this example, the policy will override whatever is in the routing table for those packets that match ACL number 100, and always forward them to the specified next-hop router. The ACL could identify these packets based on source or destination addresses or ports, or any combination.

Whenever PBR is configured in your network, you need to be extremely careful about troubleshooting routing problems.

VPNs

Another important place to look when troubleshooting routing problems is virtual private network (VPN) configuration. Many companies interconnect their remote offices using VPNs through the internet. Sometimes the VPN is a backup link in case a primary private circuit or MPLS service goes down, and sometimes the VPN is the only link. IPsec VPNs are typically used for interconnecting networks.

The critical thing to watch out for in the VPN configuration is the “interesting traffic list.” This is an ACL that defines what packets may use the VPN link, generally identifying both source and destination networks. Watch out for mismatches between the ACL on the devices on both ends of the VPN, as well as possibly missing networks.