Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Wikipedia Bug The Internet

Wikipedia Explains Today's Global Outage 153

gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."
This discussion has been archived. No new comments can be posted.

Wikipedia Explains Today's Global Outage

Comments Filter:
  • Wow! (Score:1, Funny)

    by Anonymous Coward

    DNA resolution failure

  • by Anonymous Coward

    Because of this outage, I actually had to work this morning.

  • Human or otherwise?
  • DNA DNS? (Score:2, Funny)

    by Anonymous Coward

    I could see why the failover didn't work... They should try resolving names instead of nucleic acids. :\

  • by Jazz-Masta ( 240659 ) on Wednesday March 24, 2010 @01:29PM (#31601228)

    However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.

    Good thing Wikimedia pays their System Administrators well enough to test their backup systems.

    • by X0563511 ( 793323 ) on Wednesday March 24, 2010 @01:47PM (#31601536) Homepage Journal

      I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.

      • Re: (Score:3, Informative)

        by Jazz-Masta ( 240659 )

        I actually wasn't assuming incompetence, the hallmark of many SysAdmins is being understaffed, overworked and underpaid, and thus do not have the resources to properly test all backup and redundant systems.

        As consultants and contractors in the area of System Administration, you get let go if anything like this was ever to happen. This is why they charge a little bit more.

        Whatever happened, it failed. A good lesson for next time. Not knowing exactly the cause, but it is safe to safe there were too many eggs

        • Wikimedia is terribly understaffed. They have about 35 employees [wikimediafoundation.org], for one of the 5th largest sites on the Internet (and that includes legal/finance/MediaWiki devs/etc. staff). Basically the site is run by a dozen guys. Compare that to any other Top 10 site, this is just crazy.

          Given their limited resources (both human and financial), it is amazing that Wikipedia is down so rarely. If you want the site to be more reliable, there is something you can do: Donate to the Wikimedia Foundation [wikimediafoundation.org]

        • Wikimedia is a charity. $8M to run a top 5 website is approximately NOTHING. My suggested slogan for the last fundraiser wasn't used: "Give us money. Or the homework GETS IT."

    • Re: (Score:3, Insightful)

      You say test and test again. I say that this is true only when the cost of an outage outweighs the cost of testing. What does this one hour, once per year really cost wikipedia?
      • Re: (Score:2, Insightful)

        Free media publicity.

      • Re: (Score:3, Insightful)

        by Dahamma ( 304068 )

        True, and the cost was probably fairly minor, as they are not advertising based... so only the cost of any people so pissed off with the downtime that they refuse to donate :)

      • by geniice ( 1336589 ) on Wednesday March 24, 2010 @02:23PM (#31602088)

        Going by past statsitics the cost of downtime to wikipedia tends to be negative since donations rise. Not that this is something wikimedia aims to do.

        • That may be true. Although, this whole thing has me wondering if they're re-thinking their "Green Technology" push. According to this article [datacenterknowledge.com], they've recently partnered with a European firm that specializes in Green Data Center Technology, most specifically, using "air side economization" cooling techniques (cool the data center with outside air as opposed to mechanical cooling). Now, while I think this is a viable and worthwhile technology and strategy (and before anyone flames me into oblivion for
          • I doubt this was a case of wikimedia deliberately going green. I suspect it's far more likely that they happened to be in the right place and happened to make an offer that wikimedia liked.

    • active/passive systems are a pain in the arse. The whole concept of testing failover in an active/passive situation is wrong. Anything which relies on human beings doing this and that and that and that is a bad solution.

      Just run active/active and load balancer over both sites. If one fails it's tests, you just pull it.

       

      • by rmm4pi8 ( 680224 )

        For systems that can be stateless, this is always the best approach. master-master replication with conflict resolution isn't always that easy, however, especially when you think about something like the way wikipedia edits can potentially interact. So developing a conflict resolution scheme can be extraordinarily expensive, and MySQL isn't the most stable in multi-master anyway. Thus while you're right in principle, the expense can be prohibitive.

      • Re: (Score:3, Informative)

        Yes, I agree. But the main issue with that paradigm is that many times the expense of one of your locations (and the quality of that location) is substantially lower than the other.

        Example: I run servers on the US, Brasil and Argentina. The US server has better, cheaper bandwidth than the other two. Also, since this are VoIP servers, sometimes the services I send the calls to are in the US anyway, so even if the call goes originally to Argentina's POP, I'm still forwarding it to some IP in the US anyway.

        So,

      • by xaxa ( 988988 )

        Ping [Amsterdam wikimedia cluster]: 30ms
        Ping [Florida wikimedia cluster]: 130ms

        That's from London. It's obviously better if I normally access the Amsterdam site.

        • powerdns geo backend.

          Which they're already using.... Which means it looks like the problem may be more related to automation of the testing of the sites and the subsequent automatic (vs manual) pulling of a site from the dns when it fails.
           

  • Some government pencil pusher mixed up wikileaks with wikipedia... after all the "strange tweets" from @wikileaks it sounded feasible ;)

  • by Al's Hat ( 1765456 ) on Wednesday March 24, 2010 @01:30PM (#31601242)

    ...as proof of global warming?

  • Hate it when my DNA doesn't resolve.

    Sorry I know its just a type-o, but its funny to me.

    • Re: (Score:3, Funny)

      by Locke2005 ( 849178 )
      I don't have a problem with DNA not resolving.
      I have a problem with getting it out of the sheets.
  • rndc flush (Score:3, Funny)

    by ls671 ( 1122017 ) on Wednesday March 24, 2010 @01:33PM (#31601300) Homepage

    I noticed wikipedia wasn't resolving this morning.

    Flushing my "DNA" cache fixed it ;-))

    rndc flush

    • by ls671 ( 1122017 )

      I will add that this is a good thing this article was posted. It caused me to stop investigating the possibilities of somebody hacking into my "DNA". ;-))

    • by Aladrin ( 926209 )

      That is disgusting. :D

    • by Sigma 7 ( 266129 )

      Flushing my "DNA" cache fixed it ;-))

      Not for everyone, since some ISPs cache DNS lookup results.

      • by ls671 ( 1122017 )

        > Not for everyone, since some ISPs cache DNS lookup results.

        It should have been obvious that you needed admin access to your own "DNA" in order for this fix to work... ;-))

        Also your ISP must not intercept your "DNA" queries (redirecting deoxyribonucleic acid #53 to their own DNA)

      • Why, are you forced to use your ISP's DNS servers? here [opendns.com].

        • I prefer level3's DNS servers (4.2.2.1-4.2.2.4). I've heard rumours of them planning to block public access to them, but never heard anything more about it. Works great for me.

          • by ls671 ( 1122017 )

            Here is the list of DNS to query when you run your own DNS, as I stated in my OP. You obviously need to run your own DNS in order to be able to flush the DNS cache as I mentioned in my OP ;-)

            This list of root DNS is guaranteed to remain free for public access. These DNS only return pointers to other DNS and are the foundation of how name resolving works on the internet so you are guaranteed to get the correct data as far as it is possible to get it.

            In short, no third party is required to run your own DNS.

  • Guess it resolved to a chimp?
  • by gad_zuki! ( 70830 ) on Wednesday March 24, 2010 @01:34PM (#31601320)

    Whoa, why is the DNS resolving dATP.dGTP.dCTP.dATP?!?

  • Hour Delay (Score:5, Funny)

    by Reason58 ( 775044 ) on Wednesday March 24, 2010 @01:34PM (#31601322)

    This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.

    If you don't want to wait an hour for it to update, you can open a command prompt and type "ipconfig /flushdna".

    Please be warned that this may also revert you to some sort of single-celled organism.

    • by Dancindan84 ( 1056246 ) on Wednesday March 24, 2010 @01:39PM (#31601402)
      I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.
      • Re: (Score:3, Funny)

        I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.

        ...?!

        I'd recommend you see a doctor about that.

    • by mrdogi ( 82975 )

      OK, I'm somewhat worried now. I was going to make a snarky comment on how I can't seem to find the ipconfig command on my Mac, but it *actually* has one! Mac is following Windows?!?

      At least I'm still safe with not having on on my Solaris boxen...

    • I AM a single-celled organism, you insensitive clod!

      And: I’m also your single-celled overlord! So bow to me!
      No! Not to wipe me away with your... sponge...! Please no! Aaaaahhhh!
      *wipe*

  • FTFA (Score:5, Funny)

    by Rik Sweeney ( 471717 ) on Wednesday March 24, 2010 @01:36PM (#31601362) Homepage

    We apologize for the inconvenience this has caused.

    [Citation needed]

  • Oops (Score:4, Insightful)

    by girlintraining ( 1395911 ) on Wednesday March 24, 2010 @01:41PM (#31601426)

    You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.

    • Re: (Score:3, Insightful)

      I doubt anyone lost their job over this. What is the real cost of a 1 hour global outage for wikipedia if it only occurs once per year?
      • by u38cg ( 607297 )
        Since donations spike after an outage, they profit from downtime :p
      • by tlhIngan ( 30335 )

        I doubt anyone lost their job over this. What is the real cost of a 1 hour global outage for wikipedia if it only occurs once per year?

        Having to deal with the students who couldn't crib their report off Wikipedia an hour before it was due?

        (Yes, I'm joking. But I suppose we should continue this thread with other fun things we couldn't do with Wikipedia... like make bets about something on Wikipedia - only having edited the article in your favor minutes before).

    • Since wikimedia's server admins have long since been divided into two departments known as wing and prayer they can probably avoid any job loses by blaming each other.

    • You build your systems to be fault tolerant. They automatically continue with half the components missing. Automatically disable those which fail the continually running tests.

      Build your backup tests into daily procedures. i.e. don't copy/scp files to other locations/servers/sites, restore them to the other location. Autorestore DB backups to the staging/test/dev/reporting systems daily.

      Computers are there to do stuff automatically. Getting human beings to do them is prone to failure.

      • You make some very good points in your post.

        At the end of it all comes the realization that planning for crisis is complicated, and getting it right is hard. It's also something that every organization I have ever worked with has underestimated considerably. From what little information I have about this incident with Wikimedia (I noticed nothing, myself), they did considerably better than average.

        But you are right: the right approach is not to prepare for contingency, but to make recovery part of the norma

    • Re: (Score:2, Insightful)

      by VTEX ( 916800 )

      Someone just lost their job.

      I highly doubt someone lost their job over this - and they shouldn't. There are no perfect systems out there, period. Given Wikipedia is a not for profit corporation, they very likely have limited resources and the IT staff does the best with what they have. Even with a virtual unlimited amount of resources things can still go wrong in a "Perfect Storm".

      If anything, the System Administrators should be commended for their quick actions to get the site back up and running as soon as they did.

    • by Yvanhoe ( 564877 )
      Someone does an awesome job at having a failover procedure for such an incredible non-profit project. And for resuming access within one hour. For heaven's sake, they don't even make money keeping the biggest encyclopedia of all History online, give them a break !

      Come on wikipedia, fix this, but rest assured that we all love you !
    • Yea, the problem is people tend to 'regularly test' during the work day in my experience which results in the exact same event happening anyway.

      It generally only happens once, either accident or during testing, and gets fixed. Unless you're going to do ALL your testing during off hours, which is really hard to define for a global operation, then any test that fails is just the same as a failure during non-test conditions.

      Testing for no reason other than testing is not always the brightest of ideas, contrar

  • Nothing to see here. Overheating was normal behavior after I updated the Pr0n article.

  • Edited? (Score:3, Insightful)

    by DarkKnightRadick ( 268025 ) <the_spoon.geo@yahoo.com> on Wednesday March 24, 2010 @02:05PM (#31601800) Homepage Journal

    Well, looks like all the DNA jokes are now -1 off topic

    Well played /., well played.

  • But when I got to the wiktionary.org main page I didn't see any kind of note or warning.

    Couldn't they have at least put up some kind of warning box, hopefully with a list of IP addresses underneath so that one could directly access the services when in dire need?

    .
    .
    .
    .
    .

    (I'm not really sure what constitutes "dire need" of wikimedia services, but I'm sure someone can come up with a list of relevant circumstances)

    • by PPH ( 736903 )

      I'm not really sure what constitutes "dire need" of wikimedia services, but I'm sure someone can come up with a list of relevant circumstances

      You could look up 'Dire Need' on Wiki..... oh, never mind.

  • They couldn't get to the Wiki page about failover testing.
  • From the Summary:
    "Due to an overheating problem in our European data center many of our servers turned off to protect themselves"
    "we were forced to move all user traffic to our Florida cluster"

    I think Wikipedia needs to build some data centers further north.

  • Deleted? (Score:5, Funny)

    by Grishnakh ( 216268 ) on Wednesday March 24, 2010 @02:24PM (#31602100)

    I thought maybe they had simply deleted Wikipedia because some admin decided nothing on there was "notable".

  • by rritterson ( 588983 ) on Wednesday March 24, 2010 @02:49PM (#31602462)

    I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.

    We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.

    Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.

    • As you pointed out, testing can (and in my experience with data center failures is usuaully) be the cause of a failure.

      The only time I've ever had an 'outage' in a data center, it was during a test cycle. While thats great that it was during a test cycle, it STILL resulted in an outage. Had the tests not been performed, no service disruption would have happened.

      Testing software in a test lab ... you test continuously.

      Testing a production environment ... you do it only when you have a real reason to suspec

  • if only their blog had mod points. all the comments are of the form "still down where ever I am"
  • Darn, I thought Wikipedia was going to explain today's global outrage.
  • by RAMMS+EIN ( 578166 ) on Wednesday March 24, 2010 @02:55PM (#31602574) Homepage Journal

    Speaking of Wikipedia, an idea that has long been in my mind, but that I have never sat down and worked out is distributed hosting of Wikipedia. The idea is that volunteers each contribute some resources (network capacity, storage space, RAM, and CPU cycles) to host and serve part of the content.

    This way, we should be able to reduce the load on the (donation supported) Wikimedia servers, as well as increase the redundancy in the system.

    Is anybody already working on this or are there perhaps even already implementations of this idea?

    • Re: (Score:3, Interesting)

      by u38cg ( 607297 )
      Attempts have been made at the general case, but it is a hard problem: how do you ensure fair resource sharing and reliability?
    • by BitZtream ( 692029 ) on Wednesday March 24, 2010 @04:38PM (#31604100)

      Its hard enough keeping a bunch of nodes that you control online and functioning properly (hence the failure) ... trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.

      The only reason distributed computing projects like SETI@HOME and distributed.net work is because the server gives clients data to process but it doesn't need a quick response, nor does it have to trust that the data returned is actually valid ... its going to have another host check it at some point anyway to be sure. Those clients are used to weight the data so the master server only processes the most likely packets that may match and need authoritative checking.

      Doing that for a web server would ... well, a complete and total waste of resources as its likely to be worse in every single way, including reliability.

      • trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.

        I've heard a talk from someone who suggested moving to content-addressing: instead of giving you a URL, I give you a sha1 hash of the page you want (and maybe an URL to tell you where to start looking). Then, you don't care from where you get your data, as long as it matches. You can grab the page from the originating host, or from a local cache, or from a bunch of different peers, or from... well, you name it. As long as you get the bits that match the hash, you're happy.

        I think the idea is (1) good; (2

        • That's why a git backend is such a tempting idea. Leading to many abortive attempts to write such a thing.

    • It's really hard keeping the databases distributed. Basically, all WMF wikis are served from three large database clusters in Florida. The parallelisation is having those large DB servers feed lots and lots of Apaches (which run PHP and render the pages into HTML) and worldwide Squids for reverse proxying.

      Wikileaks was mooting plans for a distributed MediaWiki backend - they have serious need for such a thing - but they haven't managed it either.

      There are perennial experimental projects to put something lik

  • I started reading the article and wondered why there was such global outrage about dns resolution on Wikipedia, then I went back and looked at the title again...
  • by porky_pig_jr ( 129948 ) on Wednesday March 24, 2010 @07:16PM (#31605736)

    I was rather pissed. And the only thing I was going to do is to look up a few math terms. Ended up using PlanetMath and few other sites, but when Wki came back, I check them as well as guess what: they had the most comprehensive and informative articles. That's the first outage I remember since I started using Wiki.

    • Re: (Score:2, Troll)

      by owlnation ( 858981 )
      It was though there was a great calming came over the Force. Like the dawning of a new age, one based on freedom and facts. One where people were free to write articles without fear of deletion and condemnation. Or edit articles without fear of biased reversion, or banishment.

      Suddenly people saw the wood from the trees, and realized there was an whole Internet out there with truth and beauty in it, where jack-booted book-burners were not only not in control, but not welcome either.

      And then they broug
    • You realise of course that pretty much all of Wikipedia is multiply mirrored ... answers.com, Google cache, Bing ...

He keeps differentiating, flying off on a tangent.

Working...