Wikipedia Explains Today's Global Outage 153
gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."
Test, and Test Again (Score:4, Insightful)
However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.
Good thing Wikimedia pays their System Administrators well enough to test their backup systems.
Oops (Score:4, Insightful)
You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.
Re:Test, and Test Again (Score:4, Insightful)
I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.
Re:Oops (Score:3, Insightful)
Re:Test, and Test Again (Score:3, Insightful)
Re:Test, and Test Again (Score:2, Insightful)
Free media publicity.
Re:Oops (Score:1, Insightful)
Re:Test, and Test Again (Score:3, Insightful)
True, and the cost was probably fairly minor, as they are not advertising based... so only the cost of any people so pissed off with the downtime that they refuse to donate :)
Edited? (Score:3, Insightful)
Well, looks like all the DNA jokes are now -1 off topic
Well played /., well played.
Re:Test, and Test Again (Score:1, Insightful)
I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.
I'm going to assume incompetence. The only question is whose incompetence: the admins, or the folks higher up the food chain who didn't give them the resources they needed. But I have no doubt somebody was incompetent somewhere, how else do you explain the failure? Can you answer that instead of telling people what to think?
Re:Oops (Score:3, Insightful)
For every hour? Really? With that logic they should just keep it down 24/7 then.
Only when combined with the premise that profit is a goal for them. Which it's not.
backup failure doesn't mean a failure to test (Score:5, Insightful)
I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.
We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.
Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.
Re:Oops (Score:2, Insightful)
Someone just lost their job.
I highly doubt someone lost their job over this - and they shouldn't. There are no perfect systems out there, period. Given Wikipedia is a not for profit corporation, they very likely have limited resources and the IT staff does the best with what they have. Even with a virtual unlimited amount of resources things can still go wrong in a "Perfect Storm".
If anything, the System Administrators should be commended for their quick actions to get the site back up and running as soon as they did.
Re:Distributed Wikipedia (Score:5, Insightful)
Its hard enough keeping a bunch of nodes that you control online and functioning properly (hence the failure) ... trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.
The only reason distributed computing projects like SETI@HOME and distributed.net work is because the server gives clients data to process but it doesn't need a quick response, nor does it have to trust that the data returned is actually valid ... its going to have another host check it at some point anyway to be sure. Those clients are used to weight the data so the master server only processes the most likely packets that may match and need authoritative checking.
Doing that for a web server would ... well, a complete and total waste of resources as its likely to be worse in every single way, including reliability.
Re:Test, and Test Again (Score:3, Insightful)
Wow. For someone who probably uses the service and doesn't pay for it, you're sure griping a lot.
0. For someone who is going off on a rant based on a reasoned assumption, you sure aren't setting off on the right foot by starting the unjustified assumption that the poster uses Wikipedia;
1. You don't have to pay for or be a net consumer of something in order to criticise it - all you have to do is provide a reasonable explanation for the criticism. The alternative, that only the paying consumer should have a voice, is irrational and harmful;
2. All this said, maybe the poster has donated time and/or money to Wikipedia - you do realise it's produced by thousands of (sometimes even well-meaning) volunteers, right?
They don't serve ads (well, except to solict funds to keep their servers up and running),
So they don't, except when they do. At least regular adverts give you the opportunity to learn about some product. Huge banners telling you that the Child in Africa will die from not knowing all the Pokemon characters if you don't donate are quite pathetic.
When you pay for an SLA with Wikipedia (signed by someone with the authority to make such an agreement) then you have the right to throw rude accusations around.
I know America has such a macho culture that it's considered life-destroying to receive public criticism, but it's actually useful to be told that you're incompetent when you're incompetent. It's the first step to finding out where you've demonstrated incompetence, which is the precursor to fixing (i) your approach; (ii) the problem. Brushing the truth under the carpet by sounding the "I/Wikipedia admins have the right not to be offended!" klaxon solves nothing.