Seeing high CPU utilization on a network device is always an uncomfortable feeling. The harder the device is working, the more likely things can go wrong.
Sometimes the extra utilization is expected—say, when your storage cluster is doing off-site backups. But other times, it’s not immediately obvious what it could be.
Today, we’ll go through a few common reasons for high CPU utilization on network devices and how to address them.
What’s your device telling you?
Almost everyone that’s worked a computer for a while has had to open up the Task Manager to find what process is slowing down their computer. Many network devices let you do this too.
For example, on a Cisco device running IOS, you can run “show processes cpu sorted 5sec” to see a sorted list of processes consuming the highest to lowest amounts of CPU utilization over a five-second interval.
Google the top five process names together with keywords like “high cpu.” Questions you’re looking to answer:
- Is the percentage utilization you’re seeing considered normal?
- What have others done to alleviate the issue?
Have there been any recent configuration changes?
One of the first things to determine is whether any changes were recently made to the device’s configuration. There are a few ways you can figure this out:
- Navigate to the device’s dashboard in Auvik. Go to Documentation > Configurations. Look to see if there’s been a recent configuration backup. If there has, use Auvik’s Compare feature to see what was inserted, deleted, or modified in the device’s configuration.
- If you have the configuration backup alert enabled for the client account in question, check your email inbox or PSA for recently fired alerts against this device.
If changes were made, it’s possible they’re affecting CPU utilization. Use Auvik’s Configuration Restore feature to revert your changes. Or manually revert the changes using the device’s GUI or CLI.
Does the CPU utilization subside? If not, a configuration change likely isn’t the culprit. Let’s move on to the next possible cause.
Have there been any recent Layer 1 changes?
Review your network topology. Look for alerts like Interface Status Mismatch on the switch for the last few days or weeks. Have cables been moved around?
Things to consider:
- Look for suspicious devices that look out of place. Are they pushing a lot of traffic?
- If wiring changes were made, try reverting them during an appropriate maintenance window to see if that correlates to a drop in CPU utilization.
Do you need an augmentation?
If the number of devices on the network or the amount of traffic is steadily growing, it stands to reason the device will be increasingly taxed. It could be a case of:
- An increasing amount of throughput
- Packet processing, like QoS shapers and policing
- The routing table filling up
- Ancillary services like DNS, DHCP, or access control lists.
- Can anything be offloaded elsewhere? For example, if your overtaxed firewall is handling DNS and DHCP for a large number of subnets, could some of that work be offloaded to a Layer 3 switch?
- Encryption-heavy processes like SNMPv3 or a large number of SSH sessions to the device.
If any of these things are the culprit, you’ll need to eventually throw more resources at the existing platform, or delegate the services to other devices or machines.
Are you seeing any bottlenecks?
Whenever packets don’t flow freely, they get queued up. A large amount of buffering takes up CPU. Common causes of queuing include large amounts of traffic from a fast interface trying to pass over a much slower one.
A slow Internet connection that’s always maxed out will cause a lot of buffering. A recommendation to your client might be to upgrade the link. If it’s a local link, you can consider increasing the size of the pipe through link aggregation or NIC teaming.
Commands like “show memory” and “show buffers” can help confirm suspicions of a bottleneck.
Are you seeing any broadcast storms?
Bursts of traffic from protocols like ARP or Ethernet can spike CPU very quickly. If you’re seeing a lot of broadcast traffic on a particular interface, for example, it may be worth doing a port mirror against the affected interface and then sniffing the traffic using a tool like Wireshark.
Depending on what you find, there are a couple of things you can try:
- Enable Ethernet storm control on your interfaces to automatically shut down affected ports when a large amount of traffic is detected. This helps to prevent a network outage.
- Decrease the dynamic ARP cache, especially if you’re in an environment with a lot of device churn.
Have there been any recent spanning tree changes?
The Spanning Tree Protocol prevents Layer 2 loops.
Generally, spanning tree is a process that runs in software and doesn’t take advantage of hardware offload. The more VLANs and active interfaces provisioned on a switch, the more CPU time is required to reconcile and reroute traffic when topology changes occur.
If you’re seeing a large number of spanning tree changes in the network, it’s probably making your CPU work harder.
Enable CPU offloading
Some devices provide the option to perform certain functions through software (using the device’s CPU) or on hardware (using ASIC chips). Wherever possible, offload tasks from the CPU onto other dedicated hardware.
Engage your vendor’s support team
If you’ve completed an investigation and still don’t know the cause of high CPU utilization, the device’s support team might be able to shed some light. Sometimes there are software bugs or a very specific device configuration that causes something to run away. They’ll work to reproduce the issue themselves and identify the root cause and a solution.