Hospital Brought Down by Networking Glitch - Slashdot

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

×

Hospital Brought Down by Networking Glitch 575

Posted by michael on Wednesday November 27, 2002 @10:41AM from the risks-digest dept.

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

This discussion has been archived. No new comments can be posted.

Hospital Brought Down by Networking Glitch

Search 575 Comments Log In/Create an Account

Comments Filter:

Major American Bank Outage (Score:5, Informative)

by MS_leases_my_soul ( 562160 ) writes: on Wednesday November 27, 2002 @10:47AM (#4767020)

A Bank in America [;)] had an outage back in 1998 where all their Stratocom went down for similar reasons. The Gateway/Network Engineering group had been saying for a couple years that we needed more redundancy but senior executives just saw the expenses and not the liability ... until every single Stratacom went down.

We had to rebuild the entire network ... it took a week. All non-critical traffic had to be cut-off as we pushed everything through the backup T1s and ISDN lines. It cost the bank MILLIONS of dollars.

Suddenly, that backup network was real cheap. They are now quite proud to tote their redundancy.

Share
twitter facebook
Hospital Systems (Score:4, Informative)

by charnov ( 183495 ) writes: on Wednesday November 27, 2002 @10:49AM (#4767033) Homepage Journal

I also used to work at a teaching hospital (Wishard for Indiana University) and I learned more there about networking and systems support than in years of college. I remember one day we found a still used piece of thick-net (you know...old firehose). It was connecting the ambulance office's systems to the rest of the hostpital. The rest of the hospital ran on DEC VAX clusters and terminals. To be fair, they have gotten much better (I don't work there anymore either), but this wasn't the first hospital network I had seen that truly terrified me, and it hasn't been the last.

Share
twitter facebook
more info, less sensationalism (Score:5, Informative)

by bugpit ( 588944 ) writes: on Wednesday November 27, 2002 @10:53AM (#4767076)

The Boston Globe article was a tad on the sensational side, and did a pretty poor job of describing the technical nature of the problem. This article [nwfusion.com] does a somewhat better job, but is still pretty slim on the details. Almost makes it sound like someone running a grid application was the cause of the trouble.

Share
twitter facebook
What is spanning tree protocol? (google whoring) (Score:5, Informative)

by Anonymous Coward writes: on Wednesday November 27, 2002 @10:53AM (#4767088)

Spanning-Tree Protocol is a link management protocol that provides path redundancy while preventing undesirable loops in the network. For an Ethernet network to function properly, only one active path can exist between two stations.

Multiple active paths between stations cause loops in the network. If a loop exists in the network topology, the potential exists for duplication of messages. When loops occur, some switches see stations appear on both sides of the switch. This condition confuses the forwarding algorithm and allows duplicate frames to be forwarded.

To provide path redundancy, Spanning-Tree Protocol defines a tree that spans all switches in an extended network. Spanning-Tree Protocol forces certain redundant data paths into a standby (blocked) state. If one network segment in the Spanning-Tree Protocol becomes unreachable, or if Spanning-Tree Protocol costs change, the spanning-tree algorithm reconfigures the spanning-tree topology and reestablishes the link by activating the standby path.

Spanning-Tree Protocol operation is transparent to end stations, which are unaware whether they are connected to a single LAN segment or a switched LAN of multiple segments.

see this page [cisco.com] for mode info

Share
twitter facebook
All Layer 2? (Score:5, Informative)

by CatHerder ( 20831 ) writes: on Wednesday November 27, 2002 @10:55AM (#4767104)

If Spanning Tree is what brought them down, and it had campus wide effect, then they're running their production networks as one big flat layer 2 network. This is almost definitely the root of the problem. Modern network design would divide the campus (and often individual buildings) into multiple subnets, using routing to get between nets. That way if something like STP goes wrong in one spot, it doesn't affect the others.

Building a parallel identical net is almost definitely the wrong answer. Especially if it uses the same design and equipment!

Unfortunately, often older networks grow in a piecemeal way and end up like this, commonly having application level stuff that requires it to be flat. The job of a good network engineer (and diplomat) is to slowly have all the apps converted to being routable and then subnet the net.

Share
twitter facebook
Re:Reliability is inverse to the number of compone (Score:2, Informative)

by pknoll ( 215959 ) writes: on Wednesday November 27, 2002 @10:59AM (#4767139)

Sure, but that's not the point of redunancy. The question you want to ask is: How likely is it that both redundant components will fail at the same time?.
That's how mirrored RAID arrays work: you increase your chances of a disk failure by adding more disks to the system due to probability; but your chances of recovering the data in the event of a crash go up, since more than one disk failing at once is unlikely.

Parent Share
twitter facebook
Re:No. (Score:5, Informative)

by Anonymous Coward writes: on Wednesday November 27, 2002 @11:01AM (#4767153)

As an employee at BIDMC (the Beth Israel Deaconess Medical Center) I can tell you that they did not just install a parallel network. The first network was completely redesigned to be more stable and once it proved its stability, then a second redundant network was put in place to ensure that if the network ever became unstable again for any reason there was a backup that was known to work immediately instead of having to wait to fix the original again. Most of the housestaff at BIDMC were already familiar with the paper system as the transition to paperless had only occured over the last two years and in stages. The real problems was obtaining lab and test results as these have been on computer for years.

Parent Share
twitter facebook
My best hospital glitch (Score:5, Informative)

by eaddict ( 148006 ) writes: on Wednesday November 27, 2002 @11:01AM (#4767154)

was a human error. We were a smallish hospital (270 beds). I was the new IS Manager. I was looking for power outlets in the computer room for all the new euqipment I had ordered. Well, there were a lot of dead plugs. Also, I was told to stop since electricity based things like that were left up to the union guys. No big deal. I called them and asked them to locate and label the outlets under the raised floor. While I was sitting at my desk later that day the power went off for a sec then on.... I got up and looked toward the data center. The lights AND the equipment went off then on. I ran in to find the union guys flipping switches on the UPS (on/off). They had stuck a light bulb w/plug in each of the open outlets and were flicking the power on and off to see what bulb was effected. They were on the equipment side of the UPS! All of our servers, network gear, and such took hard downs that day! Ahhh!!! Who needs technology to make things not work! This was the same union that wrote me up for moving a cube wall to get at an outlet. Moving furniture was a union duty!

Share
twitter facebook
Re:Major American Bank Outage (Score:3, Informative)

by passion ( 84900 ) writes: on Wednesday November 27, 2002 @11:05AM (#4767194)

If triple-redundancy is good enough for San Francisco's BART [transdyn.com], and this "major bank", then why can't it be good enough for a hospital, where there are most likely many people on life support, or who need instant access to drug reactions, etc?

Parent Share
twitter facebook
This assumes.. (Score:5, Informative)

by nurb432 ( 527695 ) writes: on Wednesday November 27, 2002 @11:07AM (#4767221) Homepage Journal

That it was a network upgrade, sometimes its not, and you have no clue what was changed, by *someone else*...

As far as a parallel network, thats a tad overkill.. proper redundant pathways should be enough.. and plenty of packet filtering/shaping/monitoring.

and keep a tighter reign on what is allowed to be attached to the PRODUCTION network..

Parent Share
twitter facebook
QoS and network boundaries (Score:5, Informative)

by pangur ( 95072 ) writes: on Wednesday November 27, 2002 @11:18AM (#4767304)

There are several non-exclusive answers to the Beth Israel problem:
1) introduction of routed domains to seperate groups of switches
2) insure that more than one redundant switching loop does not terminate in a switch. I've had a single switch be the lynch-pin between two loops, had the switch go down and back up, and spanning-tree would not converge. If you want redundancy in your switches, spread out the loops.
3) Put QoS on the network. Identify mission-critical traffic and give it priority and guarenteed bandwidth (Cisco uses LLQ and CBWFQ using DiffServ, CoS, and IP precendence). That way even if someone puts loads of traffic on mission critical paths, the effect should be limited to the local switch port or router, depending how it is implemented.
4) lastly try a redundant network. You would still want QoS to stop a jabbering NIC from hosing your local bandwidth, and you might want to run diagnostics with your pocket PC or laptop, so you would still need to plug into that isolated net anyway. I would recommend that last due to cost, space, and connectivity issues.
Thank you.

Parent Share
twitter facebook
It's HIPAA (Score:3, Informative)

by mrneutron ( 61365 ) writes: on Wednesday November 27, 2002 @11:19AM (#4767312)

Health Insurance Portability and Accountability Act.

Most health care organizations are far from clueless, believe me. Your average healthcare IT manager is well aware of HIPAA, as they've been working on the transaction and privacy aspects for quite awhile.

The techs in the trenches may know less, mostly because the data security regulations (the 3rd, and largest portion of the HIPAA work) are not yet finalized. The real work doesn't begin until then: probably sometime later this year.

Parent Share
twitter facebook
Re:Cisco implemenatation of Spanning Tree sucks (Score:4, Informative)

by netwiz ( 33291 ) writes: on Wednesday November 27, 2002 @11:20AM (#4767325) Homepage

Cisco only runs per-VLAN spanning tree if you're using ISL as your trunking protocol. The reason you don't get it on Extreme Networks stuff is because they use 802.1q. In fact, Cisco devices trunking w/ the IEEE protocol run single instances, just like the Extreme product.

There are tradeoffs, of course. STP recalculations (when running) can be kind of intensive, and if you've got to run them for each of your 200 VLANs, it can take a while. However, for my particular environment, per-VLAN STP is a better solution.

Parent Share
twitter facebook
Re:I don't buy it (Score:5, Informative)

by DaveV1.0 ( 203135 ) writes: on Wednesday November 27, 2002 @11:27AM (#4767390) Journal

Actually, if you read the article carefully, they say that the application the research was running was the straw that broke the camel's back.
"The crisis had nothing to do with the particular software the researcher was using."
"The large volume of data the researcher was uploading happened to be the last drop that made the network overflow. "
While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge.

Parent Share
twitter facebook
Re:Sure it was STP? (Score:4, Informative)

by jefftp ( 35835 ) writes: on Wednesday November 27, 2002 @11:33AM (#4767439)

The most common reason spanning tree problems occur is because no one tells the spanning tree domain who the root of the network is. This leads to the switches deciding to gets to be the root. In most implimentations of spanning tree, the lowest MAC address wins.

Because Cisco switches come with Spanning-Tree enabled by default, and because most network "engineers" don't know what spanning tree is, many corporate networks have a random switch serving as the root of the spanning tree. And so when spanning tree tries to do it's job: fail-over to a redundant link, it doesn't do a very good job because the humans who set up the network were either lazy or ignorant.

Laziness and ignorance are the villians of most network problems.

Now if Cisco implimented the follow up to spanning tree: rapid spanning tree protocol (802.1w) like the rest of the industry, you'd eliminate the problem of impatient network admins trying to "tune" their network convergence times. Sadly, at most, you're going to shave 8 seconds off the 30 to 50 seconds of convergence time of STP unless you have a very small network. So tuning STP timers is an excersize in navel-meditation. RSTP (802.1w) solves alot of the convergence time problems with original STP (802.1d) and is nicely backwards compatible.

Parent Share
twitter facebook
Redundant Networks for Patient Care (Score:2, Informative)

by jcm ( 4767 ) writes: on Wednesday November 27, 2002 @11:35AM (#4767468) Homepage

I spent three years (1995-1998) at Perot Systems as a consultant designing and implementing hospital networks for Tenet Healthcare (2nd largest hospital chain in the US). There was at least one hospital that had the budget and the foresight to see that reliance on the network would do nothing but increase.

For that hospital, my network design was one that incorporated as much redundancy as possible at the time. For each patient care area, such as nurse's stations and ancillary areas such as radiology, cardiology, surgical theaters, etc. it was decided that each of the two network jacks would terminate in seperate closets. This meant doubling the number of closets required in order to meet distance limitations, but the hospital had already started working on allocating that space for the closets. Also for any important ancillary areas such as the lab, central supply, there also was two seperate networks. For the server farms theirselves, the Patient Care systems all had redundant connections to the primary and backup networks as well.

As each wall jack terminated into a different closet, each closet had two seperate networks as well. Each closet would house the primary network for half of the jacks served, and the backup network for the other half of the jacks served. The fiber paths from each closet took disparate paths back to seperate data center rooms, one external to the main building of the campus and one inside the main building. At the time layer 3 switches, or switch routers such as the Foundry Big Irons, or Cisco 6500s were not available. So as much as I dislike using Spanning Tree, I had used it at the time. All priorities were manually set though so there was no doubt where the root was and where it would move to in case of failure.

So, the switches terminated on another switch which was partitioned to several segments. Switch connections were made between the two data center as well. Each segment had a connection to a Cisco 7507 Fast Ethernet port local to that computer room, and another in the second computer room. Forming the core were two sets of two Cisco 7507s. In order to prevent one OSPF network from affecting the other OSPF network static routes were used (would use BGP if I had to do it over again). Outside WAN connections were terminated redundantly on the two patient care networks as well.

While the primary network in the hospital also supported the non-patient care areas (such as administration, the backup network was only for the patient care areas. That was just to prevent the type of thing that happened in the article, where something non-patient care related ends up taking everything down.

Reverting to backup paper systems would be nearly impossible once the "tube" systems were sealed up. Much like the movie Brazil, hospitals used to have pneumatic tubes running all over the place, especially between the lab and the nurse stations. Running samples and results back and forth would definately introduce a LOT of delay for a doctor trying to make a life and death decision.

I am sure that I would I design things different these days (for one, Layer 3 would go all the way to every single edge switch and collapse on a fast switch router) but I think the design probably held together well. I should check back in someday and see how long and well it lasted, if they did replace it.

Jay

Share
twitter facebook
Re:Reliability is inverse to the number of compone (Score:4, Informative)

by gorf ( 182301 ) writes: on Wednesday November 27, 2002 @11:53AM (#4767623)

No.

You can only multiply them together like you have done if the two variables are independent.

Here this is clearly not the case; if the networks are identical and one fails, it is more likely that the second will fail because the cause might be identical.

Parent Share
twitter facebook
Its been coming for a log time (Score:5, Informative)

by bolix ( 201977 ) writes: <bolix@hotmaSLACKWAREil.com minus distro> on Wednesday November 27, 2002 @12:05PM (#4767709) Homepage Journal

I've consulted here. No not on the network design! Desktop staff - big hello to the much expanded Research Support team!

AFAIK the BI network has gradually evolved from the 60/70s and has including several massive growth spurts to incorporate the expansions, refits, windfalls etc. I once participated in an after hour Cisco cutover where we yanked connections and waited for the data to flow (IPX round/robin servers listing) to find the specific segments affected. Very much a live trial and error process.

I got the feeling no-one is completely certain where/how all the data flows especially in the older Research segments e.g. Dana Farber. In fact, I'm guessing this is where the failure originated. Heavy duty number crunching and spanning tree errors lead me to some sort of distributed unix process across network segments. I want to blame a certain notorious geek (Dr P's) unix and mac labs but in truth it could be any one of the overworked and underfunded labrats in any of the segments.

The wiring closets used to look way worse than any posted at the recent Register article. A single Cat 5 cable run to a data jack is sometimes split to host 2 connections: unfortunately as the Research areas are grant funded, this is still bloody cheaper than a hub/switch! There is probably still some localtalk cabling in some labs, coax runs to a DG and Novell serial connections with 1 or 2 Mac Classic and SE holdouts running Dos and DG terminal emulators!!!

The network team in the Hospital (2 afaik) coped with daily routing failures, buggy failovers, the crappy Novell IPX 802.3 implementation and servers around every corner. Those folks team with a great desktop staff to nursemaid outdated equipment into the 21st century. It stuns me to this day what a superior job these folks did and probably do. They certainly made my job easier.

I feel this could have happened any time and disaster has been averted one too many times before. Halamka and the exec staff owe these guys more that just a few column inches of chagrined praise.

Share
twitter facebook
Re:Contribution to causality responsibility (Score:5, Informative)

by timeOday ( 582209 ) writes: on Wednesday November 27, 2002 @12:09PM (#4767727)

I agree, and let me refer you to a real life example [info-sec.com]. The USS Yorktown is that very famous Navy ship that was immobilized by a network outage. The whole thing was trigged by some seaman entering a 0 where he shouldn't have, so the Navy made some attempt to pin it on him. But it didn't fly. Operational errors like that are routine. It shouldn't have crashed the app. Having crashed the app, it shouldn't have taken down the whole network.
If one resercher sitting at his desk can take down the whole hospital system accidentally just by "overusing" the network, it's just a matter of time.

Parent Share
twitter facebook
Re: Thick Coax links (Score:2, Informative)

by Ashurbanipal ( 578639 ) writes: on Wednesday November 27, 2002 @12:20PM (#4767826)

Etherhose (10b5 thick coax) is a useable networking technology. It has very good resistance to RFI/EMF. Lots of hospitals still run it, on links where 10 Mb/sec is sufficient.

Etherhose is no longer a good investment because it is labor-intensive to work with (vampire taps, and thick, heavy cabling) and because nobody is developing the technology any more.

Today, fiber optics might seem a better choice for noise isolation, since the cost has come down to a reasonable level.

However, glass has the same potential for future obsolescence as etherhose - I have a half-dozen mutually incompatible fiber links here. And termination, splicing, and interconnection of fiber is at least as difficult as working with etherhose... having done both, I'd say drilling for a vampire tap is easier.

In short, don't replace a working piece of infrastructure needlessly (wait until you project a need for additional bandwidth) and for noise isolation cat 5e in a grounded metal conduit is probably your best bet. Large diameter, professional quality conduit runs through electrically noisy areas are costly but also a very safe investment.

I wouldn't knock that old etherhose - it does its job quite well, far better than the 10b2 thin coax that replaced it ever did. And it's far more physically sturdy than anything else outside of conduit.

Parent Share
twitter facebook
Offtopic (Score:2, Informative)

by InadequateCamel ( 515839 ) writes: on Wednesday November 27, 2002 @12:40PM (#4768003)

I read in a book about the number zero that I mentioned here before that the real cause was someone accidentally left a zero in a line of code, rather than a person pressing zero and crashing the entire network. Perhaps someone tried to execute a command that led to this faulty code being used by the ship's computers?
Maybe this was proven to be false later, I dunno.
Kind of funny though...

Parent Share
twitter facebook
Re:Networks are fragile. (Score:2, Informative)

by Mr. KaryHead ( 111035 ) writes: on Wednesday November 27, 2002 @01:02PM (#4768191) Homepage

Networks can be fragile and spanning tree can certainly cause some of the problems. That is why one must design the spanning tree topology. When you say "one switch declares itself a server of a given protocol", I assume you mean "declares itself the root of a VLAN." The root is determined by the lowest advertised bridge ID from each switch. The bridge ID is the bridge priority plus the bridge address. Cisco switches have a default bridge priority. So then it boils down to whichever switch has the lowest bridge address becomes the root, which could be any switch anywhere in your network. The network admin should decide which switch will be the root for a given VLAN and set the bridge priority lower. And then he/she selects another switch to be a backup root and sets its priority to be lower than the default but higher than the root's priority. So you if don't manually set the root then a new switch plugged into the network could very well become the root if all the switches have a default priority and the new switch has a lower bridge address than the current root.

If this happens, you can just turn off the offender to get your root back. In STP only the root talks. If the other switches don't hear from the root in something like 20 seconds, then they'll elect a new root.

-Kary

Parent Share
twitter facebook
Re:Spanning tree (Score:3, Informative)

by jroysdon ( 201893 ) writes: on Wednesday November 27, 2002 @01:18PM (#4768350)

Disabling spanning tree on a network of any size is suicide waiting to happen. Without spanning tree you'll be instantly paralyzed by any layer two loops.

For instance: Bonehead user wants to connect 2-3 more PCs at his desk, so he brings in a cheap hub or switch. Say it doesn't work for whatever reason, so he leaves the cable in and connects a second port from the wall (or say later on it stops working so he connects a second port to test). When both of those ports go active and you don't have spanning tree, you've just created a nice loop for that little hub or switch to melt your network. Just be glad it's going to be a cheap piece of hardware and not a large switch, or you'd never be able to even get into your production switches using a console connection until you find the connection and disable it (ask my how I know). How long does this take to occur? Not even a second.

Spanning tree is your friend. If you're a network technician/engineer, learn how to use it. Learn how to use root guard to protect your infrustructure from rouge switches (or even evil end-users running "tools"). A simple search on "root guard" at Cisco.com returns plenty of useful hits [cisco.com]

At my present employer, we're actually overly strict and limit each port to a single MAC address and know what every MAC address in any company hardware is. We know where every port on our switches go to patch panels. If anything "extra" is connected, or a PC is moved, we're paged. If a printer is even disconnected, we're paged. The end-users know this, and they know to contact IT before trying to move anything.

Why do we do this? We've had users bring in wireless access points and hide them under their desks/cubes. We want to know instantly if someone is breaching security or opening us up to such a thing. Before wireless, I'd say this was overly anal, but now, it's pretty much a requirement. The added benefit to knowing if an end-user brings a personal PC from home, etc., on to the network (which means they possibly don't have updated MS-IE, virus scanners/patterns, may have "hacking tools", etc.). This isn't feasible on a student network or many other rapidly changing networks, but on a stable production network it's a very good idea. Overhead seems high at first, but it's the same as having to go patch a port to a switch for a new user - you just document the MAC address and able port-level security on the switch port:

interface FastEthernet0/1 port security action trap port sec max-mac-count

With Syslogging enabled, you'll know when this occurs and if you've got expect scripts to monitor and page you when another mac address is used on that port, and if you've got your network well documented, you can stop by the end-user while they're still trying to dink around hooking up their laptop and catch 'em in the act.

Yes, I know all about MAC address spoofing. Do my end-users? Probably not, and by the time they find out, they're on my "watch list" and their manager knows. Of course, that's where internal IDS is needed and things start to get much more complicated, but at least you're not getting flooded with odd-ball IDS reports if you manage your desktops tight so users can't install any ol' app they want. Higher upfront maintenance cost? Perhaps, but we've never had any end-user caused network issue.

I'm fairly certain that if someone was running a "bad" application like what hosed the network in this story, I'd find it in under 30 minutes with our current network documentation. Would it require a lot of foot traffic? Yes, as the network would possible be hosed so management protocols wouldn't work, but I could isolate it fairly fast with console connections and manually pulling uplink ports.

Parent Share
twitter facebook
WRONG!: Re:Problem was with an application, (Score:5, Informative)

by fanatic ( 86657 ) writes: on Wednesday November 27, 2002 @01:51PM (#4768662)

No application can cause a spanning tree loop. It is simply impossible.

A spanning tree loop causes broadcast frames - correectly used in small numbers in many different circumstances - to loop endlessly about the network (clogging it up), using paths that are provided for redunancy, but which are normally stopped form passing traffic by the "spanning tree protocol".

There are 2 likely causes:

Unidirectional link failure. If a connection between switches passes traffic in only one direction (normally they are bi-directional), then spanning tree can be 'fooled' into allowing traffic on a path that creates a loop and lets frames loop endlessly.

Misconfiguration of switches, possibly combined with erroneous cabling. If spanning tree is configured off on a port, (or, maybe, put into a mode called portfast), it's possible for interconnection of switch ports (through a crossover cable or other means) to cause this to occur.

A third possible cause is that the spanning tree software itself screws up and allows a loop when it shouldn't have. This was known to occasionally happen in Cisco switches some years ago. I haven't heard of it lately.

This all happens way below the application layer. Unless the application is speccific written to send huge numbers of broadcast frames (there is no legitimate reason for an app to do this), it couldn't bring down the network. And even if it did, this would not be a 'spanning tree loop' and disconnecting the offending station woul immediately fix the problem.

Probably, the network should be using routers to partition it into smaller LANs. But ths can stilll happen to any single LAN so creaeted and if it happens to the one your servers are on, you're still cooked.

Parent Share
twitter facebook
Re:WRONG!: Re:Problem was with an application, (Score:1, Informative)

by Anonymous Coward writes: on Wednesday November 27, 2002 @02:51PM (#4769185)

Actually, broadcasts are not the only type of traffic that is flooded by a bridge. Multicasts in general are flooded, as well as unicasts for which the destination MAC address has not been learned.

Building a separate infrastructure for "mission-critical" apps might be tough...is this only life-critical, or would that apply to the administrative functions, too? Besides the problem of deciding which functions the network should support, you have the problem that it is easy for someone to accidentally connect both networks together (i.e., if there is a person who has systems on both networks, and is re-wiring their cubicle and inadvertantly connects the two networks to a common switch.

Any large infrastructure like this should be subdivided at layer 3, on at least a building-by-building level, and perhaps floor-by-floor. If a subnet is larger than 2000 nodes, the likelihood of trouble rises quickly.

Another issue with Spanning Tree is that if you have a new bridge plugged in to the network that manages to convince the other bridges that it is the root (through a poor selection of default values on the part of the vendor, or a pre-existing config that isn't applicable to this network, or a mis-configuration by the end-user), then it will be in the forwarding path of *all* the flooding-based traffic (see the list in my first paragraph above). In such a scenario, broadcast-based discovery protocols like ARP will probably fail since this switch won't be even seeing certain traffic since it won't even make it onto the clogged links running upstream toward the root, many network applications will fail. And if ARP ain't happy, ain't nobody happy.

Parent Share
twitter facebook
Re:WRONG!: Re:Problem was with an application, (Score:4, Informative)

by khafre ( 140356 ) writes: on Wednesday November 27, 2002 @02:56PM (#4769241)

Actually, it is possible for an application to cause Spanning Tree to fail. Most switches have a management port that allow remote access (via telnet, ssh, SNMP, etc.) to the switch. This management port is normally connected to its own VLAN isolated behind a router so user brodcasts & multicasts in another VLAN can't affect the switch CPU. This port can be overrun with brodcasts and multicasts from user applications providing both the user and the switch are on the same VLAN. If this CPU is consumed by processing broadcasts, it may not have enough CPU time available to process and forward spanning tree BPDUs. If a blocked port becomes opened, a switch loop could form and, BINGO, network meltdown.

Parent Share
twitter facebook
Re:Problem was with an application, (Score:2, Informative)

by GarryOwen ( 190545 ) writes: on Wednesday November 27, 2002 @03:13PM (#4769386)

You sound a bit old school, routing now days can be as fast as a switch, course routers that fast will cost a hell of alot more. The reason why is most routers nowdays don't actually do a per packet inspection and routing. They route the first packet of stream and then switch all following packets in that stream. Also, if your statement the lower on the 7 layer model you are the faster you go is wrong, otherwise hubs would be faster than switches(layer 1 vs layer 2).

Parent Share
twitter facebook
Re:Problem was with an application, (Score:5, Informative)

by aheath ( 628369 ) writes: <adam,heath&comcast,net> on Wednesday November 27, 2002 @04:45PM (#4770093)

I contacted Dr. John D. Halamka to see if he could provide more detail on the network outage. Dr. Halamka is the chief information officer for CareGroup Health System, the parent company of the Beth Israel Deaconess medical center. His reply is as follows: "Here's the technical explanation for you. When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root. The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other. Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree. A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible. Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued. Thanks for your support. CIO Magazine will devote the February issue to this event and Harvard Business School is doing a case study."

Parent Share
twitter facebook
Re:Problem was with an application, (Score:3, Informative)

by pyite ( 140350 ) writes: on Wednesday November 27, 2002 @06:19PM (#4770955)

Technically, hubs are faster than switches for N endpoints when N = 2. The reason is hubs do not have to look at the frame being sent and either store-and-forward or cut-through like a switch does. Your total possible collision locations on a hub is N * (N - 1) / 2 (Gauss' formula for sum of 1 to N, coincidentally), where once again N is the number of endpoints. In a switch, your collision domain always has two endpoints, therefore your total possible collisions is 1, thus you get increased speed.

Parent Share
twitter facebook
Re:WRONG!: Re:Problem was with an application, (Score:4, Informative)

by Anonymous Coward writes: on Wednesday November 27, 2002 @06:56PM (#4771207)

Third possiblity - and what I'd be confident is the initial cause.

The amount of traffic the researcher was putting onto the network caused spanning tree hello BPDUs to be dropped.

After a period of not receiving hello messages (20 seconds if memory serves), downstream devices believe the upstream device has failed, and decide to re-converge the spanning tree.

During this re-convergence, the network can become partitioned. It is preferable to partition the network to prevent loops in the layer 2 infrastructure. Datalink layer frames eg ethernet, don't have a hop count, so they will loop endlessly - potentially causing further failures of the spanning tree protocol.

Once the bulk traffic source is removed from the network, STP should stabilise within a fairly short period - 5 minutes or so - so there may also have been a bug in Cisco's IOS, which was triggered by this STP event.

Altneratively, the network admins may have played with traffic priorities, causing this researcher's traffic to have a higher priority over STP messages, causing the STP to fail.

Radia Perlman has a good description of STP in her book "Interconnections, 2nd ed" - but then she should - she invented it.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

503 commentsHarvard, MIT and UPenn's Presidents Should 'Resign in Disgrace', Bill Ackman Says
453 commentsEra of Global Boiling Has Arrived, UN Chief Says
417 commentsOceanGate Says All Five Titan Passengers Have Died
414 commentsJudge Blocks US Officials From Tech Contacts in First Amendment Case
404 commentsIs Gen Z Giving Up on College?

The one day you'd sell your soul for something, souls are a glut.