Link aggregation is a way of bundling a bunch of individual (Ethernet) links together so they act like a single logical link.
If you have a switch with a whole lot of Gigabit Ethernet ports, you can connect all of them to another device that also has a bunch of ports and balance the traffic among these links to improve performance.
Another important reason for using link aggregation is to provide fast and transparent recovery in case one of the individual links fails.
Individual packets are kept intact and sent from one device to the other over one of the links. In fact, the protocol usually tries to keep whole sessions on a single link. A packet from the next conversation could go over a different link.
The idea is to achieve improved performance by transmitting several packets simultaneously down different links. But standard Ethernet link aggregation never chops up the packet and sends the bits over different links.
The official IEEE standard for link aggregation used to be called 802.3ad, but is now 802.1AX, as I will explain later. However, several vendors have also developed their own proprietary variants.
Common link aggregation terminology
A lot of potentially confusing terms appear in any discussion of link aggregation. So let’s quickly review them before digging a bit further into the technology.
- A group of ports combined together is called a link aggregation group, or LAG. Different vendors have their own terms for the concept. A LAG can also be called a port-channel, a bond, or a team.
- The rule that defines which packets are sent along which link is called the scheduling algorithm.
- The active monitoring protocol that allows devices to include or remove individual links from the LAG is called Link Aggregation Control Protocol (LACP).
The first important thing to know is that all links in a LAG must be a type of Ethernet (10/100/1000/10G, etc) and they must all be identical.
LACP can’t balance traffic among two Gigabit Ethernet links and a 100Mbps Ethernet link, for example. If you try, the devices will refuse to include the different link in the LAG. They might even refuse to bring up the LAG.
Further, all the links must be configured the same way. You can’t have a mix of duplex settings or different VLAN configurations or queuing features.
You can put a maximum of eight individual links in a LAG group, although some devices will restrict you to a smaller number. That said, because of the simple way most of the scheduling algorithms work, you’ll generally get better, more even load balancing if you use an even number, and preferably a power of 2 such as 2, 4, or 8.
An important concept of link aggregation is that all the packets belonging to any individual session should go down the same single link. Otherwise you risk out-of-order packets, which causes serious problems for a lot of applications.
Most scheduling algorithms use some sort of simple hash function that looks at fields in the Layer 2 and/or Layer 3 headers. The most common hashes involve the source and destination MAC addresses, the source and destination IP addresses, or both sets of addresses.
Many devices give you the option of selecting the appropriate load balancing algorithm for your network.
It’s important to note that two devices connected by link aggregation don’t need to agree on the load balancing algorithm, and sometimes you might not want them to. The goal is to select an algorithm that randomizes your packets as much as possible. That way you can expect fairly even use of all the links, which will provide the best possible performance.
On a general switched Layer 2 network with a lot of devices talking across the aggregated link in arbitrary patterns among themselves, the simplest MAC address hashing algorithm works well. Even if most of the traffic involves devices talking to a single central server, the algorithm still works well because the randomness of the MAC addresses of the other devices ensures reasonably even load balancing.
However, if the link is basically just two devices talking directly to one another across an aggregated link, then a MAC-based load balancing algorithm means all the traffic uses just one of the links.
This is the case, for example, if you have two routers (or Layer 3 switches) or two firewalls, or one of each talking across the link. You might be communicating with the whole Internet, but if all the packets are going to the same firewall, that’s one MAC address. And if all the packets are coming from one core switch, that’s also one MAC address. So a hash based only on MAC addresses won’t give you any performance advantage in such cases. In situations like this, it’s useful to use IP addresses in the load balancing algorithm.
On Cisco switches, depending on the software version, the command will be some variation of “port-channel load-balance <algorithm>”. Hitting a question mark where I have put the word <algorithm> prompts the switch to give you a list of available options.
You can easily tell whether your load balancing algorithm is appropriate by looking at the link utilizations on each of the individual links in a bundle using the “show interface” command.
If you see that one link is consistently more heavily utilized, then it might be a good idea to change your algorithm.
Note that when you change the algorithm on a device, you only change how that device behaves when sending packets. If the traffic imbalance is in the inbound direct, representing received packets, then you need to adjust the device on the other end.
Link Aggregation Control Protocol
Most of the time you’ll use 802.3ad or 802.1AX, also called Link Aggregation Control Protocol or LACP.
There are also various proprietary link aggregation protocols. Before the standardization of LACP, Cisco developed an option called Port Aggregation Protocol (PAgP) on some Cisco switches. Other vendors have similar pre-standard protocols.
PAgP is a proprietary protocol with no significant advantages over the standard LACP protocol. It really shouldn’t be used unless you happen to be connecting to a very old Cisco device that doesn’t support LACP.
The big question you’ll have to answer when configuring a port-channel is whether to configure it to be active (or, equivalently, LACP on some devices) or merely on.
The active option means the device will actively monitor the state of the link and automatically remove any failed links from the bundle. This is obviously a very good idea because it gives you fault tolerance as well as load sharing. So why would anybody ever opt not to use it?
The short answer is compatibility. If the device on one side of the link decides one of the individual connections is bad, then the device on the other end should really agree. Otherwise one device will keep dumping packets down a link that the other device isn’t watching.
A lot of server implementations don’t seem to implement the standard properly, or they cut corners and don’t implement the active monitoring function at all. I usually make everything active and only change to on if I run into trouble with it.
Multi-chassis versions of link aggregation
One of the really interesting ways of deploying an aggregated link is to connect a device to a redundant pair of central core or aggregation switches. That is, instead of being a bundle of links between two devices, it’s a bundle of connections from one device to two devices.
Such a setup requires that the twinned devices at one end of the bundle look the same. They must send the identical host identification information so that the other device believes the bundle connects to a single logical device.
This immediately suggests another useful topology, of course. If we accomplish multi-chassis link aggregation by making two devices “look” like they are a single device, there’s no reason we can’t do this at both ends.
Cisco has developed two different solutions for achieving this. The older solution, called Virtual Switching System (VSS), is only applicable to a few switching platforms, notably the 6500 and 4500x platforms.
VSS solves the problem of making two switches look like one simply by making the supervisor module in one of the switches control both physical devices. The supervisor module in the other chassis becomes a redundant backup.
The newer solution, which is available on most of the Nexus switch line, is called Virtual Port Channel (VPC).
VPC allows you to pair your switches and distribute LAGs across them. You have to create a special VPC-link between the two switches, which allows them to share all of the state information about the LAG. It also allows packets received through the LAG on one switch to reach devices that happen to be connected to the other switch.
To create a VPC LAG, then, you assign ports on both switches to the same channel-group number, and use that same number as a VPC identifier. The switches then figure out that all of these ports should be part of the same LAG.
There are a couple of important limitations to VPC. The VPC-link needs to itself be a LAG (a port-channel, in Cisco’s terminology). And each switch can only have one such VPC-link to one other Nexus switch.
Some HP Procurve switches include a similar feature called Distributed Trunking (DT). It’s important to note that DT is not intended nor recommended for aggregated links between switches. Only use DT between servers and switches.
In all of these cases, it’s necessary to interconnect the two switches that will share one end of the bundle.
The future of link aggregation
Earlier I mentioned that LACP is defined and standardized in 802.3ad and that it’s an Ethernet-specific protocol. This is actually no longer true.
The IEEE realized that link aggregation isn’t fundamentally an Ethernet concept, so in 2008 they moved it out of the 802.3 Ethernet group to the 802.1 group of standards, initially unchanged except in name. It’s now called 802.1AX.
The 802.1AX specification is being updated. Much of the work being done is intended to standardize and extend some of the multi-chassis concepts that various vendors have developed as proprietary solutions.
I also wouldn’t be surprised to see additions like LAG groups containing wireless links, or perhaps even a set of dissimilar physical link types and speeds.