Slashdot.org Self-Slashdotted

Slashdot.org Self-Slashdotted 388

Posted by kdawson on Tuesday February 10, 2009 @12:08AM from the disturbances-in-the-fabric dept.

Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."

Slashdot.org Self-Slashdotted

This discussion has been archived. No new comments can be posted.

Search 388 Comments Log In/Create an Account

Comments Filter:

Re:In Soviet Russia (Score:5, Informative)

by robophilosopher ( 847226 ) writes: on Tuesday February 10, 2009 @12:42AM (#26793665)

I believe you mean: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. The caps matters. In other words, Buffalo from the city of Buffalo that are pushed around by (other) buffalo from the city of Buffalo in turn push around (still more) buffalo from the city of Buffalo. And you thought this was unrelated to the recursive dupe comment.

Hork's been forked -- it's "borked"! (Score:3, Informative)

by zooblethorpe ( 686757 ) writes: on Tuesday February 10, 2009 @01:48AM (#26794043)

But I thought "horked" meant, y'know, horked, eh? Meaning, like, "stolen" --
Doug: Hey - somebody horked our clothes!
Bob: Geez, who'd want to hork our clothes, eh?
Cheers,

Re:*Sniff* they grow up so fast! (Score:5, Informative)

by Florian Weimer ( 88405 ) writes: <fw@deneb.enyo.de> on Tuesday February 10, 2009 @02:13AM (#26794151) Homepage

I'm surprised STP was off by default. I remember in 1999 or so I had some trouble that resulted in my having to turn STP off on Cisco switches (they shipped with it on (these were 3524s and a 5505). I can't actually remember why. I think it had something to do with a Novell server?
The problem likely was that the machine required network at boot (typical Netware clients were like that, I've been told). STP started when the link went up, but it took a rather long time, so forwarding had not been enabled when the client required the network.
Since then, I have seen exactly that situation many times in small office environments. Also, the classic plugging in while also being on the wireless side of the network.
Port security helps a lot.
STP is also not fail-safe because typical switches happily forward traffic even if the STP process running on the CPU has died. If you build a L2 core, one broken switch (or OS glitch on a switch) can still take down your entire network easily (it's one of those pesky distributed, multiple single points of failure). In general, L3 networks are somewhat more robust in this regard, so it's often a good idea to avoid switch-to-switch connections (but that might be difficult, as it is difficult to tell L2 devices from L3 devices these days).

Mis-configured trunk ports can cause such an issue (Score:3, Informative)

by wtarreau ( 324106 ) writes: on Tuesday February 10, 2009 @02:55AM (#26794347) Homepage

This thing usually happens when two switches are attached with 2 (or more) trunked links ("etherchannel" in cisco terminology), and one of the switches has the trunk disabled on one of the ports (or someone moved the cable to another port during a diag). Thus the attachment becomes a loop. STP could take care of this, but it's common to disable it on access switches.

Re:Sometimes You Have To Be There (Score:5, Informative)

by amorsen ( 7485 ) writes: <benny+slashdot@amorsen.dk> on Tuesday February 10, 2009 @04:52AM (#26794799)

Depends how good your out-of-band management is.

Re:In Soviet Russia (Score:5, Informative)

by jez9999 ( 618189 ) writes: on Tuesday February 10, 2009 @05:11AM (#26794893) Homepage Journal

There are no buffalo living in the US. Only bison. ;-)

Re:Hork's been forked -- it's "borked"! (Score:1, Informative)

by Anonymous Coward writes: on Tuesday February 10, 2009 @05:22AM (#26794927)

Your thinking of the canadian definition of the word.
urban dictionary [urbandictionary.com]

Re:Just one simple question. (Score:2, Informative)

by JustOK ( 667959 ) writes: on Tuesday February 10, 2009 @08:14AM (#26795791) Journal
1. HW http://meta.slashdot.org/article.pl?sid=07/10/18/1641203 [slashdot.org]
2. SW http://meta.slashdot.org/article.pl?sid=07/10/22/145209 [slashdot.org]
Re:Thanks for the information (Score:5, Informative)

by spartacus_prime ( 861925 ) writes: on Tuesday February 10, 2009 @09:42AM (#26796415) Homepage

I don't know about you, but I'm suing for compensatory damages. Do you have any idea much pain and suffering the work I did in that time caused me?!
Fixed that for you. Sorry, law student.

Re:Sometimes You Have To Be There (Score:5, Informative)

by jamie ( 78724 ) * writes: <jamie@slashdot.org> on Tuesday February 10, 2009 @09:52AM (#26796535) Journal

Our network engineer lives a couple of states away from the data center. The work he's talking about doing, he did from home.

Re:Would like final analysis (Score:5, Informative)

by Precision ( 1410 ) * writes: on Tuesday February 10, 2009 @10:21AM (#26796909) Homepage

I'll be sure to when I get to the data center next week and am able to get my hands on the angry switch in question. I do love how it just sat there quietly for two weeks w/o doing anything and then decided randomly to just start blasting out 20 Gbit.. sigh.. hardware..

Re:Would like final analysis (Score:5, Informative)

by Cylix ( 55374 ) writes: on Tuesday February 10, 2009 @10:41AM (#26797131) Homepage Journal

Failed ASIC on the switch most likely.
I've see an issue just like that about once a year, but working with a sick number of systems globally the chances of seeing one offs becomes fairly regular.
Depending on the failure it might have logged what it was doing, but I'll presume since your monitoring didn't catch the spike it was because it was just random garbage.
Fun times!

Re:*Sniff* they grow up so fast! (Score:3, Informative)

by Just Some Guy ( 3352 ) writes: <kirk+slashdot@strauser.com> on Tuesday February 10, 2009 @10:43AM (#26797161) Homepage Journal

They're not stupid -- in fact, most of the clients I work with do things daily that I could never accomplish -- but they occasionally do stupid things with computers and networks.
I usually prefer "ignorant", which implies that you just don't (yet) know any better. I reserve "stupid" for a special class of mistakes, like expecting servers to work while unplugged.
Put another way, stupid mistakes make you slap your forehead. Ignorant mistakes make you think, "oh, that's interesting!"

Re:Wow, that sucks (Score:5, Informative)

by goaliemn ( 19761 ) writes: on Tuesday February 10, 2009 @11:30AM (#26797723) Homepage

The point is, they hadn't already given him direct access to those connections before yesterday, and he had to spend a large chunk of those 75 minutes getting the authorization to access the equipment so he COULD fix it.
That's not how I read it at all. The switches were so overloaded that he had to "fight" to get into the box. He, more than likely, already had access to the box, but the network was working against him.

Re:Sometimes You Have To Be There (Score:3, Informative)

by guruevi ( 827432 ) writes: on Tuesday February 10, 2009 @11:36AM (#26797807)

Even with the best out-of-band management, if your switch doesn't respond or doesn't accept commands because it's out of cpu there is not much you can do. Also, just because a port is down doesn't always mean the CPU will/can ignore it. Sometimes there is no alternative than to pull out the cable.

Re:Do you get the pink screen? (Score:5, Informative)

by TheLink ( 130905 ) writes: on Tuesday February 10, 2009 @12:46PM (#26798839) Journal

core = core switch = a main switch that most of the edge switches/devices are plugged into.
reboot core = reboot a core switch.

Re:Wow, that sucks (Score:5, Informative)

by Achromatic1978 ( 916097 ) writes: <robert&chromablue,net> on Tuesday February 10, 2009 @01:50PM (#26799849)

He (she?)
For Slashdot staff, I think the generally accepted nominal is "It"...

Re:*Sniff* they grow up so fast! (Score:3, Informative)

by 222 ( 551054 ) writes: <stormseeker@[ ]il.com ['gma' in gap]> on Tuesday February 10, 2009 @03:12PM (#26801431) Homepage

spanning-tree portfast is your friend! (I'm sure you know this... just saying.)

What bothers just as much is when I see a ton of switches in an environment with their VTP mode set to Server. A small mixup with VTP version numbers and you've replaced your entire VLAN database with... an empty one! Its an easy problem to fix, but nobody likes losing their entire network, even for just a few minutes.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Slashdot.org Self-Slashdotted 388

Slashdot.org Self-Slashdotted More Login

Slashdot.org Self-Slashdotted

Re:In Soviet Russia (Score:5, Informative)

Hork's been forked -- it's "borked"! (Score:3, Informative)

Re:Sniff they grow up so fast! (Score:5, Informative)

Mis-configured trunk ports can cause such an issue (Score:3, Informative)

Re:Sometimes You Have To Be There (Score:5, Informative)

Re:In Soviet Russia (Score:5, Informative)

Re:Hork's been forked -- it's "borked"! (Score:1, Informative)

Re:Just one simple question. (Score:2, Informative)

Re:Thanks for the information (Score:5, Informative)

Re:Sometimes You Have To Be There (Score:5, Informative)

Re:Would like final analysis (Score:5, Informative)

Re:Would like final analysis (Score:5, Informative)

Re:Sniff they grow up so fast! (Score:3, Informative)

Re:Wow, that sucks (Score:5, Informative)

Re:Sometimes You Have To Be There (Score:3, Informative)

Re:Do you get the pink screen? (Score:5, Informative)

Re:Wow, that sucks (Score:5, Informative)

Re:Sniff they grow up so fast! (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot

Re:In Soviet Russia (Score:5, Informative)

Hork's been forked -- it's "borked"! (Score:3, Informative)

Re:*Sniff* they grow up so fast! (Score:5, Informative)

Mis-configured trunk ports can cause such an issue (Score:3, Informative)

Re:Sometimes You Have To Be There (Score:5, Informative)

Re:In Soviet Russia (Score:5, Informative)

Re:Hork's been forked -- it's "borked"! (Score:1, Informative)

Re:Just one simple question. (Score:2, Informative)

Re:Thanks for the information (Score:5, Informative)

Re:Sometimes You Have To Be There (Score:5, Informative)

Re:Would like final analysis (Score:5, Informative)

Re:Would like final analysis (Score:5, Informative)

Re:*Sniff* they grow up so fast! (Score:3, Informative)

Re:Wow, that sucks (Score:5, Informative)

Re:Sometimes You Have To Be There (Score:3, Informative)

Re:Do you get the pink screen? (Score:5, Informative)

Re:Wow, that sucks (Score:5, Informative)

Re:*Sniff* they grow up so fast! (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Re:Sniff they grow up so fast! (Score:5, Informative)

Re:Sniff they grow up so fast! (Score:3, Informative)

Re:Sniff they grow up so fast! (Score:3, Informative)