It was a terrible, horrible, no good, very bad week at ServInt – the worst we’ve had since 10 years ago, when a fiber cut in just the wrong place brought us offline completely for seven hours. That day was one of the most professionally terrifying of my life, but we learned from it and we grew. In the wake of that event we added redundancies far beyond the “industry standard,” we fixed a ton of processes and we quickly regained the faith of our customers. To this day, that date in 2004 was the very last time ServInt’s entire network has gone down.
Every time we experience problems I am determined to make sure we learn from them. This week is a big learning week, because it’s been fraught with some of the biggest problems we’ve seen in nearly a decade. Let me take a few moments to tell you a bit about what challenges this week brought — to show you what went wrong, and what we did right as we resolved them.
This week started with an announcement that the largest kernel level exploit in the history of our VPS and virtual dedicated offerings had been discovered. This exploit could have allowed hackers to access not only our customers’ VPSs but also the machines that they were hosted upon. A fix would require the reboot of literally thousands of servers, while minimizing the impact on our clients’ businesses — always our top priority. Within 48 hours we performed emergency maintenance on nearly every single customer in our datacenter. This meant forcing every single customer to accept at least a little downtime in the pursuit of vital security protections. Some customers did not like this, but if I had to do it all over again I would do it the same way. I am proud of the way ServInt rose to the challenge and protected our customer base from this dangerous exploit in such a swift manner.
I was really hoping that the week would get easier from there — but it didn’t. Last night, one of ServInt’s datacenters experienced one of the strangest, most difficult to explain, and most difficult to solve networking problems we have ever seen.
We build our networks to withstand most anything. We have stayed up through hurricanes, ice storms, and more equipment failures than I can count. We’ve made it through power disruption for extended periods, and other horrendous events that would have taken down providers that aren’t as thorough, many times over. But this one got us good for a while.
On Saturday evening, our network was running smoothly, as it generally has for more than a decade. Suddenly our monitoring system started showing red/green/red/green/etc. The phrase “this is not a drill” had to be used as senior engineers were plucked from their lives and rushed into the datacenter. Our COO was on a plane, I was at dinner, but the engineering fix-it team that really needed to be there was there, immediately. What made this situation unique, and what made it impossible to fix in the normal few minutes, was the fact that the critical equipment that was in the process of failing seemed incapable of making up its mind whether it was healthy or not. Making matters more challenging: high levels of equipment redundancy (normally a very good thing) made it nearly impossible to determine where the problem lay. Our top engineers, without access to reliable diagnostic data, literally had to pull the network apart and put it back together to find the exact piece of hardware that went haywire (in this case a router) that caused everything else to behave erratically. In the meantime, there was simply no information to share with increasingly frustrated customers, and our Tweets and Facebook posts began to sound unnecessarily vague.
In a typical router-failure situation, as soon as the router shows “red/down” on our monitoring system, we post “we had a failed router interrupt traffic impact the network. This is being fixed and we’re routing around it — we’re sorry for the inconvenience.” Those are facts and details, things people can get confidence from. However, with no reliable detail to pass on, our team was left to pass on rather vague updates for quite some time. It was frustrating and made us seem much worse about communication than we actually are.
In the end, last night’s events pointed out some of ServInt’s greatest historical strengths — and some newly discovered weaknesses. We’re still the best in the business at running a reliable, robust network and data center — and, when necessary, finding and fixing complex technical problems. When it comes to customer support and communication through a crisis, however, we need to do better. Having no support/communication failover systems, and forcing ServInt and its customers to rely on Twitter and Facebook to communicate, was totally unacceptable. We will build greater redundancy into our ticketing and communication systems to make sure that never happens again.
Having said that, we can’t promise that technical glitches will never happen again. They are a fact of internet life. What matters most is that we must always — always — learn from these thankfully rare events, and become a better service provider as a result. I promise you we will do so in this case as well. You’ll see the results of this growth as the weeks and months unfold. I am confident you’ll like what you see. Thank you, as always, for your continued faith and trust in us.
There’s an interesting parallel between the way people buy web hosting and the way they buy sports cars. Frequently, the sports car purchaser who doesn’t actually compete in races will buy their vehicle based on theoretical maximum performance capability, examining numbers like top speed, maximum horsepower and so forth to see how fast their dream car might theoretically go.
Of course, people who actually race for a living understand a critically important maxim: top speeds don’t win races, high average speeds do. That means it’s just as important to be able to speed around accidents and slow traffic as it is to power down the straightaways as fast as possible.
It’s the same with hosting. The size of a CPU, the amount of RAM, the network uplink speed — these are all important metrics, but everybody’s working with similar engines these days. You can get your specs and never see reliable performance at other host because your server still can’t swerve around the accidents and slower traffic without getting bogged down. Why? Because of something called IOPS. Read more
If you’ve managed online applications or websites for any length of time, you’ve almost certainly dealt with hardware failures. VPS technology mitigates some of the more common types of failures, and Cloud has mitigated others. But the fact remains, hardware failures — failures of the machines housing and crunching your data — can still happen at any time.
There are many hardware and software solutions to limit the damage from hardware failures: RAID arrays, hot-swappable drives, dual power supplies, multi-core computers, and multi-stick RAM all work to introduce redundancy into the hardware; while backup solutions, load balancing and CDNs introduce redundancy into the data.
Most hosted content, however — whether it’s hosted on a dedicated server, VPS server or “in the Cloud” — still exists on one single physical computer. So if there is a catastrophic failure of that computer, your site goes away until the data can be recovered and rewritten to the drives on a new computer. Read more
There was a time in hosting’s distant past when virtualization and Cloud were foreign words. Back then, the idea that you could put multiple customers on a single host machine and give them all fully partitioned and secure “virtual environments” — environments that looked and acted exactly like a small dedicated server — was novel, if not literally unbelievable. Most people who wanted to host a website simply assumed they had to build or rent a physical server in a room somewhere.
Oh, how things have changed. Now, actual physical infrastructure has become conceptually divorced from the idea of a “web server.” Want to host a web site? These days, you buy amorphous cloudy things like “instances” and “environments,” which you scale up or down as your site requires, nearly instantaneously. Costs are down, speed-to-deployment is way up, and it’s all pretty miraculous. But our eagerness to forget what a pain in the neck it is to actually own and manage a real, live server has also made us forget what we sacrificed to get scalability, redundancy, flexibility, and all the other benefits of virtualization.
The big tradeoff — the “con” against which all the “pros” of cloud must be weighed — is the fact that, no matter how you slice it up and partition it, shared infrastructure is just that: shared, usually by many. Read more
Last week we talked about the dangers of generalizing about website and app requirements when picking a cloud service provider. Here’s the big question we’re going to try to answer this week:
Is it even possible to compare prices between cloud hosting options?
An increasing number of large cloud service providers have been trying to address the problem of explaining just what their services cost by producing cost calculators like Amazon’s. There are a few problems with these calculators. Read more
Last week, a good friend who works at Google sent me a link to a Wall Street Journal story on the price wars that seem to be heating up in the cloud computing and storage sectors. (Editor’s note: WSJ hyperlinks only work once. To read this article run a google search for “A Price War Erupts in Cloud Services”)
I found the article fascinating, but I thought it did a surprisingly poor job helping the reader understand how the Cloud might affect real-world hosting decisions.
At the center of the problem was the effort the author made to demystify the cost of cloud hosting. In order to provide a common storage and processing task against which all the major cloud service providers’ fees would be measured, the author chose the following:
“(Hosting) a medium-sized website with about 50 million page views a month…” Read more
In January, ServInt launched our cutting-edge SolidFire SSD VPS cloud storage platform. It is simply the fastest, most highly scalable, and most reliable turn-key hosting solution on the market today.
Almost since the day we launched the SolidFire SSD VPS, our customers have been asking when they’d be able to buy a dedicated server with SolidFire SSD cloud storage.
That day has arrived!
You can now order a Flex Dedicated server with either onboard SSD or SolidFire SSD cloud storage. Both options offer the speed of an all-SSD storage array, but our SolidFire SSD cloud storage gives you additional advantages, summarized below: Read more
“This weakness allows stealing the information protected, under normal conditions, by the SSL/TLS encryption used to secure the Internet. SSL/TLS provides communication security and privacy over the Internet for applications such as web, email, instant messaging (IM) and some virtual private networks (VPNs).”
This vulnerability impacts openssl versions 1.0.1 and 1.0.2-beta. ServInt customers may have this vulnerability if they are running CentOS 6. CentOS 4 and 5 do not have versions impacted by the Heartbleed vulnerability. Read more
DDoS attacks sound like something out of one of those cheese-ball 1980s “hackers break into somebody’s computer and ignite a world war three” movies — you know, the ones that feature 400 baud modems and TRS-80s with cassette drives — but “distributed denial of service” (DDoS) attacks are very real, and are a growing problem.
ServInt, like everybody else in the hosting industry, has seen an uptick in DDoS activities on its network over the last couple of years. And while DDoS hasn’t been a major problem for us, it’s something we’re working hard to stay ahead of — which is what brought it to my attention, and what got me to make the effort to understand DDoS attacks better.
What is a DDoS attack?
A DDoS attack occurs when hackers gain control of multiple computers (that’s what makes these attacks “distributed”) and force them to make some form of system resource-dependent request of a target computer or website. The volume at which these requests are made quickly overwhelms the computer that is being targeted, and eventually the site or computer ceases functioning.
This is not the place — and I am certainly not the author — to go into the specifics of how this all works. Here’s an article that does a good job summarizing the different kinds of DDoS attacks.
What’s more important to you and me is: how can all this affect ServInt customers, and what measures does ServInt take to address the problem? Read more
My son is five years old and a digital native. For the last year or so he’s been saying that he wants to develop video games when he grows up. I’ve acknowledged his desire, telling him that he can develop video games, but that he’d need to learn how to write code and work hard. When he recently brought up his plan again I finally said, “Okay, lets do it.”
We sat down and talked about what he wanted his game to do. His first idea was a Minecraft type game with dinosaurs, I told him that was a good end goal, but that we needed to start smaller, something simple. I asked him what he wanted the goal of his game to be about, he said slaying a dragon. Then I asked him how he wanted the game to start, and he chose waking up in a cave. We then began designing our text-based adventure.
I used this opportunity to teach him the echo command. Echo is the simplest of commands, returning whatever string of text or variables is typed into the command. It’s also a great command to use to learn your first script. Read more