“If it ain’t broke, don’t fix it” is a popular quote for a simple reason: Changes can lead to unexpected results. Many network engineers have learned this lesson the hard way. Stories of admins who update firmware or configurations only to have network problems begin are common. Even big names in networking have been hit by change-induced network failures.

In July 2021, Akamai made a DNS configuration update that led to service disruptions for many major websites around the world, including Amazon and Costco. The fix? Akamai had to roll back the changes. More recently, a network change at Facebook led to an hours-long outage. This isn’t pointing fingers, just stating the facts—that changes to the network increases the risk of outages.

Harry Potter meme - DNS, BGP, and DHCP

Given risks like that, you can’t blame the paranoid network engineer that doesn’t want to change or update anything. The problem is that change is necessary. Security patches, network upgrades, and even network optimization all require some level of change. While there’s a tension between stability and updates, good network engineers learn how to keep networks updated while limiting network problems with effective change management.

Here, we’ll take a closer look at change-induced downtime and some network change management best practices that can help you avoid that.

What is change-induced downtime exactly?

A change-induced downtime is any unplanned network downtime or significant degradation of service resulting from a configuration or firmware/software change.

Basically, anytime a change is made to a network device and service is disrupted as a result, you have a change-induced downtime. Downtime and services that are completely inaccessible are the most obvious examples of a crash, but we can include severe performance issues resulting from changes in the “changed induced downtime” category too.

For example, incorrect port speeds are a textbook example of reduced performance resulting from configuration changes. Suppose you’re on a Gigabit network, and accidentally set the ports on a switch to 10Mbps or 100Mbps. In that case, the switch that was a bottleneck becomes a complete roadblock and could lead to users being outright unable to use certain services. While this is a simple example, there are plenty of other configuration changes (e.g. VPN, route advertisements, DNS, and load balancers) that can impact performance.

From a business perspective, the potential impact of change-induced downtimes include:

  • Not meeting SLAs. Often, network managers and service managers are obligated to ensure network availability meets a consistent, predefined level of service. Changed-induced downtime inherently hurts your availability metrics and can cause you to fall short of SLAs.
  • Lost productivity. Modern work depends on network availability. That means changed-induced downtime can bring productivity to a screeching halt.
  • Lost revenue. Time is money: If connectivity to customer-facing services goes down, change-induced downtime can directly impact revenue.

Common causes of change-induced downtime

Given the damage network failures can bring about, it’s important to understand what causes them. In general, change-induced downtimes tend to occur when an otherwise routine network maintenance task goes awry. Some of the most common causes of change-induced downtime include:

  • Human error. Most often, it’s us humans responsible for managing networks that are the cause of network downtime. We fat-finger a command. We misunderstand upgrade steps. We upload the wrong files. It happens—we’re human after all—and it’s to be expected as long as we’re manually performing maintenance tasks. That’s why automating routine and consistent procedures and fail safes like network configuration checklists are so important to network configuration management.
  • Firmware/software updates. As a general rule, you should apply security patches and software updates in a reasonable amount of time. From a security perspective, patching known vulnerabilities is one of the most high-impact steps you can take to harden your network. Unfortunately, sometimes vendors get it wrong. For example, some Cisco MDS Fabric Switches recently had an issue where connectivity to storage devices was lost after a routine software upgrade.
  • Network device configuration changes. Changing the configuration settings on a router, switch, firewall, or other network device is one of the most common network engineering tasks. It’s also when one of the most common reasons change-induced downtime happens. Look no further than recent headlines for an example: Facebook’s engineering team attributes their six-hour outage on October 4th, 2021 to “configuration changes on the backbone routers” in their network. It’s estimated that the outage cost Facebook (excluding revenue from Instagram and WhatsApp) anywhere from $28 to $40M USD in lost revenue.

(Editor’s note: As we were finalizing this article, streaming platform Twitch announced that it had experienced a massive breach and theft of its source code. They now believe that hackers were able to exploit a server configuration update.)

  • Installing new hardware. Normally, you think of existing hardware and software when you think about change management. However, it’s important to remember that networks are complex systems with many interdependencies between devices. Using a personal anecdote as an example: a technician I worked with once deployed an adaptor that added two network interfaces to a device. Or so he thought. The adaptor actually functioned as a two-port switch. Once he connected both Ethernet ports to the same switch, we had ourselves a network loop that caused network problems and service disruptions across the network. Another fun day in the world of networking!

How to prevent change-induced downtime

Now that we’ve established the common causes for change induced downtime (it’s basically us), the obvious next step is how do you prevent it?

It’s a multi-part approach: preparation, effective change management, reducing opportunities for human error, testing, and a reliable backup/restore strategy.

Preparing for change

“An ounce of prevention is worth a pound of cure” applies to firefighting of the network sort, too. Here are four important steps you can take before you implement a change.

1. Use the 7 Rs of change management. If you’re updating network hardware, firmware, or software, there should be a good reason. After all, no matter how much testing and planning you do, there’s always some risk involved in making a change. ITIL’s 7 Rs of change management provide a solid framework for determining whether a change is worth it, and setting you up for success once you decide to move forward:

  • Who Raised the change request?
  • What’s the Reason for the change?
  • What Resources are needed to implement the change?
  • What Return (benefit) is expected from the change?
  • Who is Responsible for implementing and testing the change?
  • What Resources do you need to deliver the change?
  • What is the Relationship between this change and other changes?

Answering these questions will help you ensure unnecessarily risky or impractical changes don’t see the light of day.

2. Research updates before you roll them out. You can learn a lot about an update before you even deploy it. Release notes, forums (e.g. Spiceworks, r/sysadmin, community.cisco.com, etc.), and a simple Google or DuckDuckGo search can go a long way. Invest a little effort to make sure there are no major known issues with a release before you update. Better yet, if you have a lab environment with similar or the same equipment, run through the update in your lab.

3. Have reliable backups. Even with all your planning and testing, things can go wrong. If they do, a set of “golden configuration” backups provide you a restore point you can revert to. If it’s a network device in your production, and supports any sort of configuration, create a backup for it. Of course, a backup isn’t worth much if you can’t restore it, so make sure to test your restore process as well.

Automated backup is the way to go! Depending on manual intervention for your backup process is a recipe for configuration drift (if backups aren’t regularly created) and human error. Automating the process with a tool like Auvik can make your network backup process significantly more reliable, and can make it easier to restore when needed.

4. Know the current state of your network. If you don’t know the current performance baselines for your network, you can’t quickly identify performance issues that occur post-change. Proactive network performance monitoring helps you get a jump on performance degradation or outages if they do occur. Similarly, up-to-date network documentation—including dynamic network maps—help you understand the interdependencies in a network, make informed decisions on network changes, and improve observability, so you can remediate issues faster if things go wrong.

Avoid downtime and reduce risk with these steps

Once the time comes to deploy a change, these steps can help mitigate risk and get a jump on recovery in the event a crash occurs.

  1. Schedule a maintenance window. Ideally, changes should be deployed during off-peak hours or during a preplanned maintenance window. This helps minimize the potential disruption of service involved with a change. Make sure your maintenance window doesn’t run right up to normal business hours either – give yourself a bit of breathing room in case something does go wrong.
  2. Have your rollback plan ready. If a change fails, have your rollback plan ready to go. For firmware/software updates, rollback plans can vary significantly from device to device. For configuration changes, Auvik’s ability to automatically manage backups ensures that a backup is created every time a change is detected.
  3. Perform pre-testing. Production shouldn’t be the first place you test a change. If you have a lab or staging environment, test your changes there first to make sure they produce the Return you predicted in the 7 Rs analysis. Alternatively, if no lab or staging environment is available, and the same change applies to multiple devices, deploy it to a subset of those devices before a large-scale rollout.
  4. Validate the state of the network before deploying the change. Configuration drift is a real problem in networks. If the network device isn’t in the state you think it is before deploying a change; you can experience unexpected results.
  5. Automate as much of the process as you can. This is particularly important for multi-device changes. Asking a human to do the same thing multiple times comes with the risk of human error. Using automated tools makes sure you’re deploying the same exact changes to each device. Just make sure you’re automating the correct, well defined task. Simply automating a broken process won’t get you very far!
  6. Perform post-change monitoring. After the change is deployed in production, monitor the specific device(s) and broader network performance. Make sure performance meets or exceeds your baselines and the change did what you expected.

Final thoughts: The importance of change management, monitoring, and reliable backups

You can’t control when the next zero-day will force you to patch dozens of devices in production quickly. But there are steps you can take to set yourself up for success even when you’re under time pressures. A change management strategy, incorporating best practices like the 7 Rs, helps you make sure a given change is a wise business decision. Real-time network monitoring helps you track performance before, during, and after the deployment of the change.

And, most importantly, an up-to-date backup you can quickly restore provides you with a contingency to get back to a working state fast in case things don’t go to plan.

Want to try Auvik’s automatic backup functionality for yourself? Get your free 14-day Auvik trial here.

Leave a Reply

Your email address will not be published. Required fields are marked *