In the last few years I’ve helped grow the networks — including the processes surrounding them — in two companies facing explosive growth. To understate matters, it’s not easy. But in hindsight the technological angle wasn’t the hardest part.
In both cases, I joined a company with fewer than 150 technical people. I say “technical” because it’s a relevant distinction. And I highlight the number 150 because if you’ve read Malcolm Gladwell’s bestseller, The Tipping Point, you recognize it as the rough maximum number of people you can keep straight in your head.
As you approach this magic number it becomes increasingly difficult to keep tabs on who to talk to regarding what, and eventually you lose all track. This describes the crescendo I walked into in both companies. And though I experienced the pain of migrating from personal relationships to systems, I can practically guarantee you that postponing or avoiding each migration would have made things worse.
A primary example is a ticketing system. In late 2005, I arrived at Myspace, the second network engineer at a site that was rapidly becoming the most popular site on the Internet. Our ticketing system consisted of:
- An open floor plan.
- Knowing your coworkers by name.
- Text files, emails, and the occasional sticky note.
The people there weren’t stupid — they were overwhelmed by success. The transition to a system, a scalable solution that didn’t rely on “tribal” knowledge, consisted of a multi-pronged attack. None of the prongs were fun. We did them because we were actively hurting. Donning our big-kid pants meant the following changes:
Implementing a ticketing system
When your friend walks up to you and requests a change, you have to tell him to put it in a ticket. Yes, it’s a very irritating thing to say to someone. There’s a distinct chance they’ll be miffed, thinking you’re blowing them off. You’re not, and you need to assure them that not only are you not, but you’ll start prepping the change for them before they enter the ticket … but THEY NEED TO PUT IN THE TICKET.
Beyond cutting down on interruptions from people walking up behind you and impatiently shifting their weight back and forth until you address them, the tickets are there to prove to management you need more help. Tickets pack a few other bonuses, like seeing who keeps requiring “emergency” level help because they can’t plan a day ahead — but you might want to keep that part to yourself.
Holding meetings to share information
You have to have meetings. I’m honestly sorry to have written that. Not many meetings, not long, and certainly not gratuitous, but you need to tell people what’s going on.
Have you ever found out from a server administrator’s request that you’re opening a new site? I have. As you get more coworkers you can’t rely on them hearing you yell something to your friend, especially since rapidly growing companies tend to run out of floor space and can’t put everyone together logically.
Standardizing all the things
Standardization has to happen across the spectrum, from the high level (do we use BGP to every ISP, even in a single-homed site, or just use statics?), down to templating common changes.
This is actually a pretty easy step if you can spare people long enough to sit down and discuss matters. Since we’re talking about small businesses that need to “turn the corner,” this usually means coordinating buy-in from under five people.
Unfortunately, standardization includes deciding on a naming convention. I hate naming convention discussions. They tend to drag on far too long and elicit input from people who duck real technical work and just want to be seen doing something.
That said, naming your firewall Smaug is cute, but it’s not very helpful when you have eight firewalls across four sites and one dies. Again, this is where things stop being fun and personal, and start being systematized and predictable.
In a small IT shop people are often encouraged to solve their own problems. That’s great when you have fewer than 20 geeks, especially at a tech startup where they’re likely to be of the high-octane variety. But eventually you end up not being able to log into a problematic piece of gear with one of the seven+ “standard” logins, and there’s no documentation telling you who configured it, how, or even why. And since they named it Smaug you can’t guess either.
But the maxim of “no one-offs” really extends to anything you plan on doing repeatedly. If you need to see how much bandwidth specific links are pushing more than once a month, set up a graphing box, assign it an official babysitter, and tell everyone they can stop doing “show interfaces” on routers.
Documenting all the things
Documentation needs to stop being an afterthought. The best solution I’ve seen for documenting is a simple wiki. Until there’s a wiki, whatever documentation exists is hiding on various laptops. You need to make it as convenient and inviting as possible for geeks to document their work, because they hate doing it.
Two wiki-related notes bear mentioning: First, if you build the layout of the wiki and agree where things should be stored, it drastically cuts confusion. If you let everyone dump random notes or documents on the front page, within a year or two it’s useless because no one can find anything. Second, if it’s only a wiki when you use a specific web browser, it’s not a wiki. Dictating people’s choices down to the browser level annoys them. Do that sort of thing enough and the quality people leave.
Rotating on-call duties
You need an official on-call rotation. There’s a window of time where the “main” network engineer just sort of handles after-hours problems, but this scales into a mess. At a certain point there are too many calls going to one person, and since he’s not officially on call he’ll probably start ignoring a few.
The best results I’ve seen have come from having a weekly primary engineer and a backup, with everyone else allowed to go out and have a life. At one job, we subscribed to a really cheap service that let us use the same two toll-free phone numbers for primary and backup. Whoever started on call that week updated the forwarding rules to land on their phone. Until we did that there was always at least a little confusion as to who to call in an emergency, which is exactly when you need clarity.
Introducing change control
Finally, you need change control. In a small shop moving at warp 8, everyone is used to getting things done with no red tape or waiting, but when you’re running as fast as possible you tend to trip on your own feet a lot. Not only does a lack of change control contribute to outages and network entropy as people fling commands around to “make it work,” it’s extraordinarily stressful on the network engineers.
Over the years as I’ve attended various classes, I’ve made a point to ask colleagues what sort of change control process they use. The overwhelming reply was either, “… What?” or, “My manager approves the abstract.” This way lies madness.
The most effective change control I’ve ever seen consisted of one person writing up a change, and a peer reading it over line by line, taking as much care and time as the author. If the change was applied and caused a problem, both people were on the hook.
This process slowed changes down, but it took a lot of pressure off the change’s author, and the bulk of our mistakes all but disappeared. Ultimately, it translated into better uptime and fewer awkward conversations explaining ourselves to annoyed managers.
A philosopher named Alan Watts once described the back of a tapestry as chaotic, but only at a certain magnification. If you magnify what you’re looking at, you see a pattern. Magnify again, more chaos. Again, more symmetry. If you can focus your mind’s eye on different angles and levels of your network, seeing the chaos that requires attention becomes progressively easier.