We introduce redundancy into networks to improve reliability. The idea is that if one device fails, another can automatically take over. By adding a little bit of complexity, we try to reduce the probability that a failure will take the network down.

But complexity is also an enemy to reliability. The more complex something is, the harder it is to understand, the greater the chance of human error, and the greater the chance of a software bug causing a new failure mode.

So, when designing a network, it’s important to balance redundancy against complexity.

The math of network redundancy

First a little math. The probability of any one device or component failing always need to be considered as a probability over time—because when something fails, fixing it will take time.

If I have a redundant path, I want to make sure the second path doesn’t fail before I’ve had time to fix the first one. You calculate this by multiplying the probability of the first failure by the probability of the second.

In many cases, you’ll have a quoted service level agreement (SLA) like 99.9%. This is the fraction of the time the service will be available. You can think of this number as the probability of “not failure.” Subtracting the number from 100% gives you the probability of failure.

For example, suppose we have two redundant services, both with a 99.9% service level. Each has a failure probability of 0.1% (or 0.001). The probability of both failing simultaneously is 0.001 x 0.001 = 0.000001. We can convert this back into an SLA figure by subtracting it from 1 and expressing it as a percentage: 99.9999%.

Notice there’s a shortcut—just count the 9s. Both SLAs in this example were three 9s: 99.9%. The aggregate SLA is six 9s: 99.9999%.

The hidden assumption

But there’s a critical hidden assumption in this math, which is that the two probabilities are completely independent of one another, which is seldom really true in IT.

For example, two circuits could both fail simultaneously because of a car hitting a telephone pole. Or two circuits into the same MPLS network could both become unavailable at the same time because of a disaster in the core of the carrier’s network.

But the single biggest reason for failures in IT environments is human error. Redundancy tends to increase complexity, and complexity increases the chances of human error. So just increasing the number of redundant systems doesn’t necessarily increase your overall reliability.

And this is the key to using redundancy effectively: you need to keep the complexity under control.

Tips for achieving minimal complexity

Here’s a list of things to keep in mind for implementing network redundancy while minimizing complexity.

  • Identical systems with identical connections
  • I like to provide redundancy by implementing exact duplicate systems in key spots in the network. For example, a core switch will be two identical switches. When I say identical, I mean they should be the same model, running the same software, and they should have the same connections as much as possible.

    The easiest way to do this with switches is to use stackable switches. Then there’s really nothing to do—connect up the stacking cable and you have redundancy out of the box.

  • Simple redundancy protocols
  • There are a lot of ways to implement network redundancy. The most reliable ones involve the simplest configuration on the fewest devices.

    For example, if I need a highly available firewall, I’ll implement a pair of devices. And I’ll always use the vendor’s fail-over mechanisms. Then I don’t need to worry about making the firewall take part in any routing protocols. Unless there’s a compelling reason for the firewall to run a routing protocol, it only introduces unnecessary complexity.

    Use the simplest configuration that meets the requirements at hand.

  • Keep everything parallel
  • One thing that often trips people up is how to connect successive layers of redundant devices. The trick is to keep it all parallel. Create an A path and a B path with a cross-over connection at each layer. The idea is that any one device can fail completely without disrupting the end-to-end path.

    For example, suppose I have a pair of access switches, a pair of core switches, and a pair of firewalls. I’d connect access switch A to core switch A, which also supports firewall A. Similarly, access switch B connects to core switch B, which connects to firewall B. I’d also connect the two access switches to one another and the two core switches to one another.

    In this example, you may be tempted to further connect access switch A to core switch B and access switch B to core switch A. It’s certainly a common configuration, but as soon as you do this, you need to know what you’re doing in terms of link aggregation and spanning tree. That could add considerable extra complexity if you’re new to network design.

  • Never do more than you need to do
  • As the previous example suggests, it’s easy to go further in implementing redundancy than is absolutely required. In many cases the extra redundancy is warranted and could provide additional functionality. But carefully consider every piece of equipment, every link, and every protocol. For each one, ask whether it’s providing enough additional functionality to warrant the additional complexity.

  • Cookie cutters
  • Finally, it’s extremely useful to follow a standard model when implementing your networks. If you have multiple data centers, make them as nearly identical as possible in terms of topology.

    Similarly, make your access switches as nearly identical as possible. Use common VLAN assignments everywhere, have a common IP addressing scheme that works everywhere. Make the default gateway on every segment follow a simple common rule such as the first or the last IP address. If you use redundancy protocols like HSRP, use them everywhere, and configure them the same way everywhere.

    All this similarity helps limit the possibility of human error. Maybe the new engineer has never looked at this particular device before. But if it’s exactly the same as every other device performing a similar function, then it’s much less likely that he or she will miss some obscure bit of protocol magic that was implemented on this device and only this device.

Designing for redundancy

There are useful redundancy protocols at many different OSI layers. The first thing to think about is what happens at each layer if you lose any individual link or piece of equipment.

If you’re new to this, I suggest creating detailed Layer 1-2 and Layer 3 network diagrams showing every box and every link. Put your pencil or your mouse on each line or box in succession and ask these questions for each element:

  • What happens at Layer 1 if this box or link goes down? Do you still have connectivity?
  • What happens at Layer 2? Do you still have continuity of all VLANs throughout the network?
  • What happens at Layer 3? Do you still have a default gateway on each segment?

There are a lot of different redundancy protocols around, not all of which are equally robust. You’ll need to choose appropriate protocols for your equipment and network, but here are the ones I generally use.

At Layer 1 and 2, I like to use Link Aggregation Control Protocol (LACP) for link redundancy. This includes multi-chassis LACP variations like Cisco’s Virtual Port Channel (VPC) technology, available on all Nexus switches.

Note, however, that most multi-chassis link aggregation protocols have serious limitations. HP’s Distributed Trunking, for example, is best used for providing redundant connectivity for servers, and can have strange behavior when interconnecting pairs of switches.

The other important Layer 2 protocol to use is Spanning Tree Protocol (STP). I prefer the modern fast converging STP variants, MSTP and RSTP. (I’ve written about spanning tree before.)

At Layer 3, your redundancy mechanisms need to make the routing functions available when a device fails. The choice of protocol here depends on many factors.

If the devices on this network segment are mostly end devices, such as servers or workstations, then I prefer to use a protocol that will allow the default gateway function to jump to a backup device in case the primary device fails. The best choices for this are Cisco’s proprietary Hot Standby Routing Protocol (HSRP) or the open standard Virtual Router Redundancy Protocol (VRRP).

If the segment is being used primarily to interconnect network devices, then it might make more sense to use a dynamic routing protocol such as OSPF, EIGRP or BGP. I don’t advise using the older RIP protocol because it has serious limitations in both convergence time and network size.

However, I strongly advise against using both types of protocols, like deploying HSRP with OSPF. Doing this can lead to network instability, particularly when dealing with multicast traffic.

For physical box redundancy, the exact technology will dictate the best choice. For firewalls, which need to maintain massive tables of state information for every connection, there are no viable open standards. In these cases, you really need to use the vendor’s proprietary hardware redundancy mechanisms.

Similarly, stackable switches are always very simple to deploy, usually requiring almost no special configuration to achieve box-level redundancy. The only thing to bear in mind is that you need to be careful about how you distribute connections between the stack member.

For other devices like switches and routers, it makes sense to combine a Layer 1-2 and a Layer 3 protocol from the ones discussed above. Be careful, though. Make sure the same device is the “master” at all layers. For example, at any moment your Layer 3 default gateway should be the same physical device as the spanning tree root bridge.

In all cases, make sure you thoroughly understand the implementation guidelines for each technology you’ll be using and follow them carefully. If you don’t understand it, trying it out in a production network can be career limiting.

Maximum availability with minimum complexity

The goal is maximum availability with minimum complexity. So it’s vitally important to keep the configuration simple. Don’t implement multiple redundancy mechanisms that are trying to accomplish the same logical function.

When it comes to routing protocols in particular, think about whether you can get away with a static route pointing to an HSRP default gateway. Routing protocols have to distribute a lot of information among a lot of devices, and that always takes time. HSRP and VRRP are both faster and simpler so you should use them if you can.

If you have stacked switches, think about what happens to upstream and downstream connections if one stack member fails. Where possible, you should distribute these links among the various stack members.

Above all, remember that building a real-world network is not a test where you have to demonstrate your understanding of every configuration option. Points won’t be deducted for using static routes and trivial default configurations. Keep it simple.