Slashdot.org Self-Slashdotted

Slashdot.org Self-Slashdotted 388

Posted by kdawson on Tuesday February 10, 2009 @12:08AM from the disturbances-in-the-fabric dept.

Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."

Slashdot.org Self-Slashdotted

This discussion has been archived. No new comments can be posted.

Search 388 Comments Log In/Create an Account

Comments Filter:

Things are bad... (Score:2, Insightful)

by spartacus_prime ( 861925 ) writes: on Tuesday February 10, 2009 @12:14AM (#26793471) Homepage

When even Slashdot gets slashdotted. Now if only we can make the Digg effect bury that site. For good.

and still no work done (Score:5, Insightful)

by qw0ntum ( 831414 ) writes: on Tuesday February 10, 2009 @12:22AM (#26793525) Journal

Even though /. was down, I still managed to not get any work done. Maybe it had something to do with the fact I kept rechecking to see if it were back up. Or maybe I should just stop blaming my laziness on external factors and just admit it is a personal problem: I would still find ways to not do work even without Slashdot! :P

Re:Wow, that sucks (Score:4, Insightful)

by Arthur Grumbine ( 1086397 ) * writes: on Tuesday February 10, 2009 @12:41AM (#26793653) Journal

And "access from the home office" would allow them to do what exactly?!?

Re:*Sniff* they grow up so fast! (Score:5, Insightful)

by adolf ( 21054 ) writes: <flodadolf@gmail.com> on Tuesday February 10, 2009 @12:42AM (#26793663) Journal

Naw. Stuff sometimes, yaknow, happens. People sometimes make mistakes, and hardware sometimes just breaks. It's not always ignorance -- especially, I'd guess, at the level of Slashdot's back end.
I once implemented a VoIP phone system at a factory in an evening. (This, in itself, was an undertaking - close to 200 extensions, up and running, between Wednesday at close of business and Thursday when folks started showing up, including three hours on the phone with Sprint to get the PRI and T1 circuits reconfigured at 2:00AM.)
We left, tired and groggy, with an IP phone placed in a common area for the facilities network admins to train any staff who needed training, at about 7:30AM. At 8:30, after I finally got home and managed to close my eyes, my phone rang. It was the network admin. He had a few minor issues which could've waited, but the real problem was that their network was totally fucked: Packets everywhere. No capacity to do anything. An amazing cascading failure of the sort that one hopes to never see.
And it wasn't any hodge-podge network, either. HP Procurve switches configured in a redundant fabric mode with gigabit fiber links - hot stuff or the time, especially for a factory. The wiring was all new, and was all good. The network had been designed specifically to avoid the limitations of Ethernet, and was successful to that end (a non-trivial task in an existing building complex). But it was tripping all over itself.
Turns out that someone had taken that fancy IP phone in the common area with its built-in unmanaged switch, and plugged both of its 10/100 Ethernet jacks into the wall. (Nobody knows who.)
The ensuing packet storm broke everything. Unplugging one of them fixed the problem pretty much immediately.
I wrote about this here once before, and everyone's immediate reply was this: "Well, duh. They should've turned the Spanning Tree Protocol on, and this wouldn't have happened. They're obviously idiots."
But the truth is so much more simple: People make mistakes. It was a mistake to keep STP turned off in that environment, and it was a mistake to plug two fancy ports of a Procurve switch into two dumb ports on an IP phone. Had either of those mistakes not happened, things would've been fine.
But mistakes happen anyway. We do our best, as IT professionals, to minimize these mistakes, or at least keep them away from production. But sometimes, despite having the best people and the best tools and all the knowledge it takes to make stuff work, shit just happens.

Layer 2 Loop (Score:1, Insightful)

by Anonymous Coward writes: on Tuesday February 10, 2009 @01:19AM (#26793889)

Looks like a L2 loop somewhere, and the consequent broadcast ( which may include multicast) storm coming over /. datacenter. Check for ports with spanning tree disabled, and a misplaced cable.

The worst thing about this? (Score:5, Insightful)

by chrome ( 3506 ) writes: <chrome AT stupendous DOT net> on Tuesday February 10, 2009 @01:20AM (#26793893) Homepage Journal

The worst thing about this? 5,000,000 people who think they know what happened, posting "helpful" suggestions or analysis
"The problem is definitely spanning tree!"
or
"Back in 1998, we were running these HP switches right, and ..."
or
"Did you try resetting the flanglewidget interface?!"
or
"I've seen this exact problem! You need to upgrade to v5.1!"
etc
Its not your network. It doesn't matter how much you think you know, you don't know the topology, or the systems involved. It'll be interesting to know what the ACTUAL reason was, when they figure it out. Assuming it isn't aliens.

Re:Spanning Tree (Score:2, Insightful)

by blosphere ( 614452 ) writes: on Tuesday February 10, 2009 @01:58AM (#26794089) Homepage

You've considered using portfast on edge ports? :P You know, it's been there for awhile...

Re:*Sniff* they grow up so fast! (Score:1, Insightful)

by Anonymous Coward writes: on Tuesday February 10, 2009 @04:11AM (#26794651)

Not sure about HP or others, but on cisco there is an option called bpdu-guard. (other managed switches should have a similar option as part of STP)
Make sure this is enabled on ALL ports that are not connecting any other "managed" switches (under your control of course)
This will cause the port to go into an "error-disabled" state.
So when some idiot decides to loop a single cable to two ports on a wall plate it shuts them both down within a second, (and labels them as such in the interface status for that port) without this option it will loop traffic to infinity.
Found this out the hard way during my first year as network admin, someone saw a cable in a bundle under a table and decided it must be connected to something, except it was already connected to an unmanaged switch which looped itself. (have made this option standard on all my switch configs ever since)
posting as AC as i have misplaced my login at the moment.

Re:*Sniff* they grow up so fast! (Score:4, Insightful)

by totally bogus dude ( 1040246 ) writes: on Tuesday February 10, 2009 @04:27AM (#26794699)

I'm somewhat wondering how you manage to set up a fully redundant switched network without using spanning tree at all? I suppose they might've enabled it just for the switch interconnects and left it off for the access ports so they'd come up faster. Still if that was the case, they should've been aware of the risks and symptoms thereof.

Re:irc.freenode.net also experienced outages (Score:3, Insightful)

by jibjibjib ( 889679 ) writes: on Tuesday February 10, 2009 @05:00AM (#26794827) Journal

It sounds more like a network configuration accident or glitch than an attack. Besides, netsplits aren't incredibly unusual.

Re:In Soviet Russia (Score:4, Insightful)

by thePowerOfGrayskull ( 905905 ) writes: <marc...paradise@@@gmail...com> on Tuesday February 10, 2009 @10:06AM (#26796695) Homepage Journal

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

What ever happened to "Duck duck duck goose"?

Re:Sometimes You Have To Be There (Score:5, Insightful)

by dkf ( 304284 ) writes: <donal.k.fellows@manchester.ac.uk> on Tuesday February 10, 2009 @11:10AM (#26797489) Homepage

Depends how good your out-of-band management is.
And whether anyone's been "smart" enough to decide to run the out-of-band management access over the same network as the production networking "to save resources"...

Re:Sometimes You Have To Be There (Score:2, Insightful)

by Guiness17 ( 606444 ) writes: on Tuesday February 10, 2009 @12:45PM (#26798815)

Indeed. Back in the day [/old gravely voice] when I was with Bell Northern Research it was primarily mainframes and Sun Sparcs on the network.

PC's were just starting first being commonly connected. People were writing their own network stacks. Inevitably, someone would write a bad one, install it on a couple of machines, and a broadcast storm would result.

Which meant someone from our group would go over with a pair of sidecutters...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Slashdot.org Self-Slashdotted 388

Slashdot.org Self-Slashdotted More Login

Slashdot.org Self-Slashdotted

Things are bad... (Score:2, Insightful)

and still no work done (Score:5, Insightful)

Re:Wow, that sucks (Score:4, Insightful)

Re:Sniff they grow up so fast! (Score:5, Insightful)

Layer 2 Loop (Score:1, Insightful)

The worst thing about this? (Score:5, Insightful)

Re:Spanning Tree (Score:2, Insightful)

Re:Sniff they grow up so fast! (Score:1, Insightful)

Re:Sniff they grow up so fast! (Score:4, Insightful)

Re:irc.freenode.net also experienced outages (Score:3, Insightful)

Re:In Soviet Russia (Score:4, Insightful)

Re:Sometimes You Have To Be There (Score:5, Insightful)

Re:Sometimes You Have To Be There (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot

Things are bad... (Score:2, Insightful)

and still no work done (Score:5, Insightful)

Re:Wow, that sucks (Score:4, Insightful)

Re:*Sniff* they grow up so fast! (Score:5, Insightful)

Layer 2 Loop (Score:1, Insightful)

The worst thing about this? (Score:5, Insightful)

Re:Spanning Tree (Score:2, Insightful)

Re:*Sniff* they grow up so fast! (Score:1, Insightful)

Re:*Sniff* they grow up so fast! (Score:4, Insightful)

Re:irc.freenode.net also experienced outages (Score:3, Insightful)

Re:In Soviet Russia (Score:4, Insightful)

Re:Sometimes You Have To Be There (Score:5, Insightful)

Re:Sometimes You Have To Be There (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Re:Sniff they grow up so fast! (Score:5, Insightful)

Re:Sniff they grow up so fast! (Score:1, Insightful)

Re:Sniff they grow up so fast! (Score:4, Insightful)