Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Announcements News Slashdot.org

Slashdot.org Self-Slashdotted 388

Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."
This discussion has been archived. No new comments can be posted.

Slashdot.org Self-Slashdotted

Comments Filter:
  • by BadAnalogyGuy ( 945258 ) <BadAnalogyGuy@gmail.com> on Monday February 09, 2009 @11:10PM (#26793423)

    So if you hammer your own servers, do you have to send an email to krow to get your privileges restored?

  • Wow, that sucks (Score:3, Interesting)

    by drachenstern ( 160456 ) <drachenstern@gmail.com> on Monday February 09, 2009 @11:11PM (#26793427) Journal

    So why didn't ya'll have access from the home office?

  • by sleeponthemic ( 1253494 ) on Monday February 09, 2009 @11:11PM (#26793429) Homepage
    Now if you could just post the link to the form where I can claim my full refund (for time not wasted incurred) I'll go back to being a loyal "customer".
  • by MindlessAutomata ( 1282944 ) on Monday February 09, 2009 @11:11PM (#26793431)

    In Soviet Russia, Slashdot slashdots Slashdot!

  • A.I. (Score:5, Funny)

    by gmuslera ( 3436 ) on Monday February 09, 2009 @11:12PM (#26793451) Homepage Journal
    probably the biggest proof that Slashdot has become sentient is that is willing to suicide self before seeing again another batch of Idle videos.
  • by exley ( 221867 ) on Monday February 09, 2009 @11:12PM (#26793455) Homepage

    Slashdot has apparently learned how to masturbate, because it is now fucking with itself!

  • The HAMSTERS?
    http://www.webhamster.com/ [webhamster.com]

  • by Toe, The ( 545098 ) on Monday February 09, 2009 @11:13PM (#26793461)
    Any day you get to legitimately use "horked" in a public post can't be all bad. :P
  • by Midnight Thunder ( 17205 ) on Monday February 09, 2009 @11:13PM (#26793467) Homepage Journal

    When you do work out what the root cause was, I am sure we would all like to find out what it was, so please post an update when you can.

  • by Anonymous Coward on Monday February 09, 2009 @11:14PM (#26793469)

    Who Slashdots the Slashdotters?

  • When even Slashdot gets slashdotted. Now if only we can make the Digg effect bury that site. For good.
  • by narcberry ( 1328009 ) on Monday February 09, 2009 @11:18PM (#26793489) Journal

    First thing I'd do as Cyber Security Tzar would be to outlaw any network device that has the potential to become faulty.

    We could've avoided this tragedy entirely.

    • by MBGMorden ( 803437 ) on Tuesday February 10, 2009 @12:16AM (#26793875)

      Indeed. Studies show that you're far more likely to get hacked if you keep a computer in your home. Indeed it's often even a case where an attacker is able to wrest control of your own computer from you and use it against you.

      At the very minimum, given the elevated hazard potential to kids (over 90% of kids will suffer a computer accident before the age of 18), you should always keep your computers and networking equipment securely locked in separate compartments.

      I'm not going to go so far as you and call for an outright ban, but I think it's obvious that we need common-sense computer control laws put into place. In particular, we need to stop the widespread smuggling of these devices from across the borders of places such as Taiwan, Japan, and California, into our outer-city suburbs.

      • Re: (Score:3, Funny)

        by MightyYar ( 622222 )

        Couldn't we legislate the sale of a keyboard lock with every computer? Or maybe a smart computer that only responds to the hand of it's registered, legal owner.

  • by qw0ntum ( 831414 ) on Monday February 09, 2009 @11:22PM (#26793525) Journal
    Even though /. was down, I still managed to not get any work done. Maybe it had something to do with the fact I kept rechecking to see if it were back up. Or maybe I should just stop blaming my laziness on external factors and just admit it is a personal problem: I would still find ways to not do work even without Slashdot! :P
  • www.slashdot.org loads just fine but slashdot.org gives a 500 internal server error.

  • by lymond01 ( 314120 ) on Monday February 09, 2009 @11:58PM (#26793769)

    The year is 2025.

    Well, Ladies and Gentlemen, here you see what you may think is an archaic lot of old computers. You would be mistaken. These are Slashdot. No, no cause for alarm...and that door's locked anyway, you can't get out through there. The tour only goes forward. But I'm glad at the very least that you know what Slashdot is. Not was. IS.

    It's a safeguard against...something. Something that was unleashed for 75 minutes in 2009 that crippled what was rumored to be the most robust public-facing cluster known. All we have left from that fateful day is the single post from the Slashdot network admin. Someone archived it, lucky us, because he was never seen after that day. I have a copy here, hardcopy of course -- no sense in taking risks so close to...well....

    Here it is:

    I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something. I just don't know what yet.

    • by jd ( 1658 )

      *cue Holst's Mars* (Hey, we all know CmdrTaco is related to Professor Bernard Quatermass)

    • by JWSmythe ( 446288 ) * <jwsmythe@noSPam.jwsmythe.com> on Tuesday February 10, 2009 @02:11AM (#26794417) Homepage Journal

      Nah, I used to run one of the bigger, well know publically facing clusters [alexa.com]. It was ranked #300 by Alexa when I left over 2 years ago. What's happened since is their own fault. :)

          Actually, this wouldn't have downed that network. Every GigE circuit was individual to a city, or set of racks (depending on the site). There were no cross connects between them. Almost everything was designed so if we lost a city for any reason, it didn't hurt the site. We had connectivity outages, and even a couple brownouts that upset the power systems, but the sites were always accessible.

          Slashdot should not, under any circumstances, be hosted in one location. In my opinion, they should be at the largest continental and intercontinental peerings that they can be at.

          1 Wilshire, Los Angeles, CA - providing the west coast of the US, and the most substantial fiber links on the Pacific.

          111 8th Ave, New York, NY - providing the east coast of the US, and virtually all of the links to Europe.

          36 NE 2nd St, Miami, FL - providing the southeast US, redundancy for the Southeast US, and some fiber to Europe and S. America

          Redundant options.

          426 S LaSalle St, Chigaco, IL - providing good service to the East and West coast of the US

          55 S Market St, San Jose, CA - providing good service to the West coast of the US, and some trans-Pacific connectivity

          Some people really like Atlanta, Dallas, Houston, Las Vegas, Salt Lake City, and Vienna/Ashburn/Reston. I don't really suggest it, if you can have a presence in the better locations.

          There are some very nice global options too. I'm not sure how well the European networks have cleaned up. Several years ago, due to peering arrangements over there, most European traffic ended up going to New York and back to Europe, even though we were on one of the top Tier 1 providers. We ditched the site, and sent all of Europe to New York. Our users sent complements on our "new data center in Europe", since it was so fast. :) People like to complain, but rarely send complements. That was interesting. There are some great locations in Australia and Asia also, but ... well ... it's all in how much you want to spend.

          I know people in the Silicon Valley always scream when I suggest them as secondary, but if you've had a good look at all the major cities, you'd get over yourselves. Just because you live there, and there are expensive neighbors, it doesn't make you the center of the world.

          Slashcode would need some revamping to make work in this environment. There are lots of options there too.

          But, I'm not on the Slashdot IT team, so I don't get to make these decisions (or even give opinions).

      • If it were me, I'd go for both California options. They're both near enough to the San Andreas Fault to be vulnerable to a major quake, but far enough apart that no one temblor would get both of them.
        •     Nah, sometime later this year the big one will split California from Mexico through Oregon, and make the island state previously known as SansAngeles. :)

              Now, when will they get fiber run across the gap is another questions. :)

           

  • by GaryOlson ( 737642 ) <slashdot @ g a r y o l s o n .org> on Tuesday February 10, 2009 @12:08AM (#26793821) Journal

    ...the problem down to a pair of switches...I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something â" I just don't know what yet.

    Is it possible the duplicate article generator tried to spawn, became entangled in its own potential well of duplicity, and now is trapped like two Lisp programmers deep inside their parenthesis?

  • In Korea, only old people slashdot slashdot. The memes are funny. The insightful comments are insightful. The funny comments are funny, the trolls are trolls. Seems reseting slashdot fixed everything. The entire world is doomed!
  • by chrome ( 3506 ) <`ten.suodneputs' `ta' `emorhc'> on Tuesday February 10, 2009 @12:20AM (#26793893) Homepage Journal

    The worst thing about this? 5,000,000 people who think they know what happened, posting "helpful" suggestions or analysis

    "The problem is definitely spanning tree!"

    or

    "Back in 1998, we were running these HP switches right, and ..."

    or

    "Did you try resetting the flanglewidget interface?!"

    or

    "I've seen this exact problem! You need to upgrade to v5.1!"

    etc

    Its not your network. It doesn't matter how much you think you know, you don't know the topology, or the systems involved. It'll be interesting to know what the ACTUAL reason was, when they figure it out. Assuming it isn't aliens.

    • by XanC ( 644172 ) on Tuesday February 10, 2009 @12:42AM (#26794007)

      ...Because if it's aliens, then it won't be interesting?

      • by jd ( 1658 )

        Not really. Aliens log onto Slashdot a lot. The Timelords are the worst offenders, using the Matrix and a space/time inversion multiplexor to access the unused ports on the Slashdot switches directly.

        • Re: (Score:3, Funny)

          by Darth ( 29071 )

          this actually explains duplicate posts pretty well...
          The time lords, for a joke, take stories from slashdot, go back a day or two, and submit them. They get posted a few days early, but to avoid paradox, reality requires the "original" post to be made anyway. Thus we get double posts of stories.

          You all owe the slashdot editors an apology.

    • by jd ( 1658 ) <`imipak' `at' `yahoo.com'> on Tuesday February 10, 2009 @12:58AM (#26794087) Homepage Journal

      It's likely multicast-related, as that's where TFA states the problem was seen. There are only so many multicast issues you can have. True, we don't know the topology. True, we don't know the switch configuration. True, it's just as possible this is some sort of revenge by the Church of Scientology for all the Slashdot articles on them.

      However, some things seem more plausible than others. Since this was a spontaneous problem, hardware seems more suspect than software. If it is software (unlikely but possible), the only multicast protocol most switches use are the spanning-tree protocols.

  • Slashdotted (Score:5, Funny)

    by Greyfox ( 87712 ) on Tuesday February 10, 2009 @12:37AM (#26793983) Homepage Journal
    Mirror [slashdot.org]
  • by jamesh ( 87723 ) on Tuesday February 10, 2009 @12:44AM (#26794023)

    I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something â" I just don't know what yet

    We had something similar happen at a client site - a switch failed in a rack so we temporarily replaced it with an 8 port 'desktop' switch, and then a day later installed the proper replacement back in the rack. We didn't want any unnecessary downtime though so we linked them together and left instructions with the onsite guy to move all the connections from the desktop switch into the proper switch after hours. Which he did, including the cable that linked them together. The switch was in 'portfast' mode so any broadcast packet that got 'onto' the switch, stayed there :)

    • Re: (Score:3, Funny)

      by powerlord ( 28156 )

      The switch was in 'portfast' mode so any broadcast packet that got 'onto' the switch, stayed there :)

      First rule of portfast mode:

      What ever happens in portfast mode, stays in portfast mode.

  • by His Nastiness ( 542696 ) on Tuesday February 10, 2009 @01:01AM (#26794103) Homepage
    February 9th, 2009 8:55pm Slashdot becomes self-aware.
  • ...were he not typing that long-a$$ summary. Twice as fast if he didn't have to spellcheck.

    (j/k)

    Which leads me to this question:
    What do Slashdotter staff read to avoid doing work?

  • Is this happening more often than it used to? I mean, it's tech and this is a non-paying site for most of us... it's going to break. But I swear, I remember we used to go over a year w/o seeing /. downtime, now it seems like it happens every few months.

    Or have I just become more of a /. junkie than I used to be?
  • by wtarreau ( 324106 ) on Tuesday February 10, 2009 @01:55AM (#26794347) Homepage

    This thing usually happens when two switches are attached with 2 (or more) trunked links ("etherchannel" in cisco terminology), and one of the switches has the trunk disabled on one of the ports (or someone moved the cable to another port during a diag). Thus the attachment becomes a loop. STP could take care of this, but it's common to disable it on access switches.

  • Seen That Once (Score:5, Interesting)

    by maz2331 ( 1104901 ) on Tuesday February 10, 2009 @02:40AM (#26794537)

    A couple years ago, I had to troubleshoot a problem that was similar for a school district's network. Absolutely nothing could communicate.

    I checked switches, routers, and servers for a while until I hooked a sniffer up, and still got bafflling results.

    THEN I decided to go low-tech, and start disconnecting cables. That got me somewhere - certain backbone connections could be disconnected and traffic levels dropped to normal levels.

    So, I hooked them back up, and went to the other end of the link, and started disconnecting things port by port until I found the problem.

    It turned out to be an unauthorized little 4-port switch that had malfunctioned, and was spewing perfectly valid (as in, good CRC) packets to the LAN, but with random source MAC addresses.

    THAT took down every switch in the network, as it required them to update their internal tables on a per-packet basis. The thing was actually not sending much data, but it was poisoning the switchs' internal tables. Not at the IP layer, but at the MAC layer.

    When networking gear goes rogue, it can do really bad things to other connected equipment.

    It's really hard to find the problem because every indication from every other piece of equipment is confusing. You almost always have to go to the backbone and disconnect entire segmets to find it.

  • Dogbert (Score:4, Funny)

    by ciderVisor ( 1318765 ) on Tuesday February 10, 2009 @05:14AM (#26795189)

    ...being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down...

    What did I say that sounded like "Tell me about your day at work" ?

To be is to program.

Working...