Learning From Mistakes, Growing Through Crisis
It was a terrible, horrible, no good, very bad week at ServInt – the worst we’ve had since 10 years ago, when a fiber cut in just the wrong place brought us offline completely for seven hours. That day was one of the most professionally terrifying of my life, but we learned from it and we grew. In the wake of that event we added redundancies far beyond the “industry standard,” we fixed a ton of processes and we quickly regained the faith of our customers. To this day, that date in 2004 was the very last time ServInt’s entire network has gone down.
Every time we experience problems I am determined to make sure we learn from them. This week is a big learning week, because it’s been fraught with some of the biggest problems we’ve seen in nearly a decade. Let me take a few moments to tell you a bit about what challenges this week brought — to show you what went wrong, and what we did right as we resolved them.
This week started with an announcement that the largest kernel level exploit in the history of our VPS and virtual dedicated offerings had been discovered. This exploit could have allowed hackers to access not only our customers’ VPSs but also the machines that they were hosted upon. A fix would require the reboot of literally thousands of servers, while minimizing the impact on our clients’ businesses — always our top priority. Within 48 hours we performed emergency maintenance on nearly every single customer in our datacenter. This meant forcing every single customer to accept at least a little downtime in the pursuit of vital security protections. Some customers did not like this, but if I had to do it all over again I would do it the same way. I am proud of the way ServInt rose to the challenge and protected our customer base from this dangerous exploit in such a swift manner.
I was really hoping that the week would get easier from there — but it didn’t. Last night, one of ServInt’s datacenters experienced one of the strangest, most difficult to explain, and most difficult to solve networking problems we have ever seen.
We build our networks to withstand most anything. We have stayed up through hurricanes, ice storms, and more equipment failures than I can count. We’ve made it through power disruption for extended periods, and other horrendous events that would have taken down providers that aren’t as thorough, many times over. But this one got us good for a while.
On Saturday evening, our network was running smoothly, as it generally has for more than a decade. Suddenly our monitoring system started showing red/green/red/green/etc. The phrase “this is not a drill” had to be used as senior engineers were plucked from their lives and rushed into the datacenter. Our COO was on a plane, I was at dinner, but the engineering fix-it team that really needed to be there was there, immediately. What made this situation unique, and what made it impossible to fix in the normal few minutes, was the fact that the critical equipment that was in the process of failing seemed incapable of making up its mind whether it was healthy or not. Making matters more challenging: high levels of equipment redundancy (normally a very good thing) made it nearly impossible to determine where the problem lay. Our top engineers, without access to reliable diagnostic data, literally had to pull the network apart and put it back together to find the exact piece of hardware that went haywire (in this case a router) that caused everything else to behave erratically. In the meantime, there was simply no information to share with increasingly frustrated customers, and our Tweets and Facebook posts began to sound unnecessarily vague.
In a typical router-failure situation, as soon as the router shows “red/down” on our monitoring system, we post “we had a failed router interrupt traffic impact the network. This is being fixed and we’re routing around it — we’re sorry for the inconvenience.” Those are facts and details, things people can get confidence from. However, with no reliable detail to pass on, our team was left to pass on rather vague updates for quite some time. It was frustrating and made us seem much worse about communication than we actually are.
In the end, last night’s events pointed out some of ServInt’s greatest historical strengths — and some newly discovered weaknesses. We’re still the best in the business at running a reliable, robust network and data center — and, when necessary, finding and fixing complex technical problems. When it comes to customer support and communication through a crisis, however, we need to do better. Having no support/communication failover systems, and forcing ServInt and its customers to rely on Twitter and Facebook to communicate, was totally unacceptable. We will build greater redundancy into our ticketing and communication systems to make sure that never happens again.
Having said that, we can’t promise that technical glitches will never happen again. They are a fact of internet life. What matters most is that we must always — always — learn from these thankfully rare events, and become a better service provider as a result. I promise you we will do so in this case as well. You’ll see the results of this growth as the weeks and months unfold. I am confident you’ll like what you see. Thank you, as always, for your continued faith and trust in us.