Hospital Brought Down by Networking Glitch

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Hospital Brought Down by Networking Glitch 575

Posted by michael on Wednesday November 27, 2002 @10:41AM from the risks-digest dept.

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"

This discussion has been archived. No new comments can be posted.

Hospital Brought Down by Networking Glitch

Search 575 Comments Log In/Create an Account

Comments Filter:

Problem was with an application, (Score:5, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @10:44AM (#4766996)

according to the coverage in the printed 11/25/02 Network World magazine I read yesterday. My immediate reaction was that this person who brought down the net using his research tool should not have been using a production network.

Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.

Share
twitter facebook
Of course it can help (Score:2, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @10:46AM (#4767010)

Yes, a second, fully redundant network would be "good" from a stance of giving better fail-over potential.

But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?

Which puts them right back to where they were.

Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.

Share
twitter facebook
Leading question (Score:4, Insightful)

by Junks Jerzey ( 54586 ) writes: on Wednesday November 27, 2002 @10:48AM (#4767028)

do you think the answer to having an massive and unreliable network is to build a second identical network?

Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?

Share
twitter facebook
A second (unreliable) network? (Score:4, Insightful)

by shrinkwrap ( 160744 ) writes: on Wednesday November 27, 2002 @10:49AM (#4767037)

Or as was said in the movie "Contact" -

"Why buy one when you can buy two at twice the price?"

Share
twitter facebook
Um.. (Score:4, Insightful)

by acehole ( 174372 ) writes: on Wednesday November 27, 2002 @10:50AM (#4767049) Homepage

In six years they never thought to have a backup/redundant system in place in case of a failure like this?

Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.

Share
twitter facebook
Re:Problem was with an application, (Score:5, Insightful)

by cryptowhore ( 556837 ) writes: on Wednesday November 27, 2002 @10:51AM (#4767050) Homepage

Agreed, I work for a bank and we have several environments to work in, including multiple UAT, SIT, and Performance Testing Environments. Poor infrastructure managment.

Parent Share
twitter facebook
2nd network (Score:4, Insightful)

by Rubbersoul ( 199583 ) writes: on Wednesday November 27, 2002 @10:51AM (#4767053)

Yes I think having a 2nd network for a vital system is a good idea. This sort of thing is used all the time for things like fiber rings were you have the work and protect path. If the primary work path goes down (cut, maintenance what ever) then you roll to the protect. Yes it is a bit more expensive but in case like this maybe it is needed.

Share
twitter facebook
Politics (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @10:52AM (#4767069)

I work at a med school / hospital and in my experience some of the greatest issues are political ones. The school is not for profit, but the hospital is privately owned. The outcome? The school get's fleeced - imagine paying over $50 a month for a port charge! The hospital should have enough money from that to build an adequate network...but that assumes that the focus is in the correct place. All too often the focus is on politics (in a place full on PhD's and MD's, the whole driving force is political power and reputation.) instead of technology. The network suffers while the Senior Officers buy new handmade mahogany desks, that sort of thing.

Doesn't really matter. If you had to deal with Med Students as we do, you'd die before you went to the doctor. Trust me.

Share
twitter facebook
Re:That's why I hate automatic routing (Score:3, Insightful)

by parc ( 25467 ) writes: on Wednesday November 27, 2002 @10:53AM (#4767081)

And your change in routing policy is going to affect spanning tree how?

How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
Hand-editing of routing tables works only in the most simple of networks.

Parent Share
twitter facebook
Of course they need another network (Score:5, Insightful)

by virtual_mps ( 62997 ) writes: on Wednesday November 27, 2002 @10:54AM (#4767096)

Why on earth would a researcher be plugged into the same network as time-sensitive patient information? Yes it's expensive, but critical functions should be seperated from non-critical functions. And the critical network needs to be fairly rigidly controlled (i.e., no researchers should "accidentally" plug into it.) Note further information in http://www.nwfusion.com/news/2002/1125bethisrael.h tml [nwfusion.com]

Share
twitter facebook
Re:Problem was with an application, (Score:4, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @10:56AM (#4767108)

So a researcher with a workstation isn't allowed to use the network do to his job? No, this stems from incompetence on the part of the network engineering team.

Parent Share
twitter facebook
Re:Spanning tree (Score:3, Insightful)

by TheMidget ( 512188 ) writes: on Wednesday November 27, 2002 @10:58AM (#4767126)

I think the answer is to disable spanning tree.
On a network as complex and messy as theirs? That's basically the situation where you need spanning tree, or else it just crumbles to dust once they do produce a loop...

Parent Share
twitter facebook
Re:Reliability is inverse to the number of compone (Score:4, Insightful)

by Xugumad ( 39311 ) writes: on Wednesday November 27, 2002 @10:59AM (#4767137)

However, the probability of both failing at the same time is:

0.1 * 0.1 = 1%

So as long as it can run on just one out of two, get you get ten-fold increase in stability.

Parent Share
twitter facebook
Re:Spanning tree (Score:3, Insightful)

by AKnightCowboy ( 608632 ) writes: on Wednesday November 27, 2002 @11:00AM (#4767149)

I think the answer is to disable spanning tree.
Are you talking about a different spanning tree protocol than I think you're talking about? Spanning tree is a very good thing to run to stop loops exactly like this. More than likely one of the hospital network techs misconfigured something and ended up disabling it (portfast on two access points linked into another switch accidently or a rogue switch?).

Parent Share
twitter facebook
Are you crazy? (Score:2, Insightful)

by AriesGeek ( 593959 ) writes: <aries&ariesgeek,com> on Wednesday November 27, 2002 @11:01AM (#4767160) Homepage Journal

Disable STP? And create, or at least take the risk of creating bridging loops? That will bring the network right back down to its knees!

No, disabling STP is NOT an option. Learning how to use STP properly is the option.

Parent Share
twitter facebook
Re:No. (Score:5, Insightful)

by barberio ( 42711 ) writes: on Wednesday November 27, 2002 @11:02AM (#4767163) Homepage

The problem here is that it will take days, maybe weeks to do this. Hospitals want the data flowing *Now*.

So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.

Parent Share
twitter facebook
The real problem (Score:4, Insightful)

by Enry ( 630 ) writes: <enry.wayga@net> on Wednesday November 27, 2002 @11:02AM (#4767165) Journal

There was no central organization that handled the networking for the associated hospitals, so more networks just got bolted on until it couldn't handle the load.

So what's the lessons?

1) Make sure your solution scales, and be ready in case it doesn't.
2) Make sure some overall organization can control how networks get connected.

Share
twitter facebook
I don't buy it (Score:5, Insightful)

by hey! ( 33014 ) writes: on Wednesday November 27, 2002 @11:02AM (#4767168) Homepage Journal

The same explanation was floated in the Globe, but I don't buy it.

People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.

The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.

One thing I would agree with you is that the hospital probably needs a separate network for life critical information.

Parent Share
twitter facebook
Fix it the first way that works. (Score:3, Insightful)

by tomblackwell ( 6196 ) writes: on Wednesday November 27, 2002 @11:03AM (#4767175) Homepage

If you have something that's broken, and you need its functionality soon, and don't have a fucking clue as to what's wrong with it, you might want to replace it.

It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.

Share
twitter facebook
Re:the sad part (Score:3, Insightful)

by krinsh ( 94283 ) writes: on Wednesday November 27, 2002 @11:04AM (#4767186)

While paper-based may seem like the best solution to you; what you don't realize is that paper-based is just a single phrase for the rest of these 'bases':

sneaker-based when everyone must run throughout passing paper;

warehouse-based when rows upon rows of storage are now required to keep all these bits of paper;

administrative overhead based when you realize that it takes two minimum-wage file clerks for every one form per desk - not functional area - to file and find and that takes a LOT of time;

and Mexican-based (yes, I said Mexican - who do you think most major businesses pay to do this? I know for a fact they ship things like this there by the truckload.) when you need cheap data entry and "error checking" [which is very unreliable when they can't read your language!] to enter information that could not be read from handwriting and then index them with a reasonable filing code.

Having spent a considerable amount of time as an admin assistant myself; and later as a document imaging and workflow support person, I can tell you that the cost and manpower savings far outweigh any perception or consideration for robustness or reliability.

The PHBs - or very likely the 'managed care' people (and that should have been put in quotes too) that provide a lot of the funding for the hospitals likely decided to save a few thousand since it wasn't lifesaving equipment or blood products/pharmaceuticals/etc.

Parent Share
twitter facebook
Re:Short answer? No. (Score:2, Insightful)

by 42forty-two42 ( 532340 ) writes: <bdonlan.gmail@com> on Wednesday November 27, 2002 @11:04AM (#4767188) Homepage Journal

The researcher was just entering data in. Not experimenting with the network. Where do you expect him to store his experimental resulst? On a ZIP disk?

Parent Share
twitter facebook
CCNP/CCIEs not what they are cracked up to be? (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @11:08AM (#4767225)

Hrmm, says that many CISCO engineers rushed in to "save the day" and did not get it fixed. I have seen this before. Perhaps those CISCO CCNP/CCIEs are not really that good... Then again, as someone else pointed out, if the current network engineer at the hospital did not have the common sense to revert any changes that were made, or figure out a (relatively) simple spanning tree problem, he should be the 1st to go. Sheesh, people need to recall the fundamentals of networking and protocols before they are made heads of very large networks.

Share
twitter facebook
Been there done that, got the ass beating (Score:3, Insightful)

by nt2UNIX ( 16001 ) writes: on Wednesday November 27, 2002 @11:09AM (#4767235) Homepage

In a large switched network spanning tree can save your butt and burn it. We try to test our switch changes before they are implemented. ON A TEST NETWORK.

I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.

We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.

Test Network GOOD (if you have the money).

Share
twitter facebook
The Solutoin (Score:5, Insightful)

by Shishak ( 12540 ) writes: on Wednesday November 27, 2002 @11:09AM (#4767238) Homepage

Is to not bother with a second network. They need to break the spanning tree up a bit with some layer 3 routers. Sometimes it is fun to have a nice big layer 2 network. It makes life easy. It sucks to debug it when one half of a leg goes down and you get spanning-tree loops. The switches go down in a ball of flames that way.

The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.

Sure you'll be a touch slower routing between the segments but you'll have much more reliability.

Share
twitter facebook
Add a second network? Not likely to help (Score:5, Insightful)

by markwelch ( 553433 ) writes: <markwelch@markwelch.com> on Wednesday November 27, 2002 @11:11AM (#4767245) Homepage Journal
> Do you think the answer to having an massive and unreliable network is to build a second identical network? <
Of course not. Two solutions are more obvious:
1. Fix or replace the existing network with a more reliable one (probably one that is less centralized so outages would not affect the entire campus); or
2. If a second network is going to be added to provide reliable backup, then the second network should certainly not use the same technology as the first.
A third, and somewhat obvious, solution would be to make sure that
- crucial data is kept on the local server farm, but also copied in real time to a remote server; and
- a backup access mode (such as a public dial-up internet connection, with strong password protection and encryption) is provided for access to either or both servers, in the event of a crippling "local" network outage.
This might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.
The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.
I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.
Share
twitter facebook
Redundancy and death (Score:2, Insightful)

by FearUncertaintyDoubt ( 578295 ) writes: on Wednesday November 27, 2002 @11:17AM (#4767293)

Of course, as open as they were about the whole incident, the hospital did not disclose whether any patients were affected or even died due to the breakdown (nurses having wrong information, staffing problems caused critical situations to wait too long, etc.).
A lot of people here have said "build a 2nd network," to which some have basically said, "that's stupid, make your first network run right." I think that if we're talking about life and death of patients, a second network would be a good idea. It's like the high factors of safety built into things like, say, an elevator -- a failure can cause death, so you overbuild it. Remember that you don't have to make everything redundant, just those crital parts of the system. Maybe all the administrators can only use the primary network, but the blood testing labs and nurses' stations and such can use either primary or secondary. Cutting off non-critical traffic during an outage also helps keep the whole system more stable.

Share
twitter facebook
Life threatening? (Score:3, Insightful)

by saider ( 177166 ) writes: on Wednesday November 27, 2002 @11:17AM (#4767295)

I hope "The machine that goes ping" does not require the network to run. My guess is that much of that equipment is plugged into the red outlets and can run on its own for a fair amount of time. If it is hooked up to the network it is to report the machine status, which is independant of machine operation.

The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.

Parent Share
twitter facebook
Re:Hospital Systems (Score:5, Insightful)

by gorf ( 182301 ) writes: on Wednesday November 27, 2002 @11:17AM (#4767301)

To be fair, they have gotten much better...

You seem to have forgotten to explain why they were worse.

If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.

...truly terrified me...

What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.

New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.

Parent Share
twitter facebook
Re:Been there done that, got the ass beating (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @11:21AM (#4767332)

We implement udld aggressive mode to get around this, if udld detects a layer one problem, it immedialty err-disable's the port, thereby taking funky links out of the network.

Parent Share
twitter facebook
I work at a teaching hospital... (Score:5, Insightful)

by pacsman ( 629749 ) writes: on Wednesday November 27, 2002 @11:22AM (#4767342)

The network isn't too bad, but the incompetence of the people that run it astounds me. I've had large segments of it go out unnoticed by them because a UPS failed in a closet somewhere. Took them forever to track it down, too. In the end it's not the routers/switches that scare me, but the tons of old, outdated, unpatched Solaris machines that exist on this network. There are so many manufacturers out there that use crappy installations to run their MRI and CAT scanners that it terrifies me. It's really only a matter of time until all me and my company's doomsaying (we're a third party vendor that supports a medical image archive) will come true. Unfortunately, I think it will collapse on us because the IS people will be unable to handle it.

Parent Share
twitter facebook
Maybe not so ridiculious (Score:2, Insightful)

by lucifuge31337 ( 529072 ) writes: <daryl@intros[ ]t.net ['pec' in gap]> on Wednesday November 27, 2002 @11:22AM (#4767343) Homepage

This sounds like a case of poor network infrastructure management. That being said, you can't pin it all on IT. Organizations like this have networks that grow out of necessity, and are often nearly impossible to make large changes to.

Perhaps the seemingly ridiculious "secondary" parallel network can be put in place not for redundancy, but as a tool to migrate the existing devices to a properly configured and routed network. If STP brought the whole thing down to begin with, they are probably flat. VLANs and subnetting at closets with appropriate L1 redundancy and L3 routing is mostly likely the modern network design their IT staff has known for years that they should have, but never had the convincing argument they needed to get management to foot the bill and allow the service disruptions required to make the switch.

Share
twitter facebook
Re:No. (Score:5, Insightful)

by ostiguy ( 63618 ) writes: on Wednesday November 27, 2002 @11:27AM (#4767386)

If a network problem breaks down network 1, what is going to stop it from breaking network #2? If the problem was with the firmware in device#23a, the problem will reoccur on network 2 with device #23b

ostiguy

Parent Share
twitter facebook
Re:Life threatening? (Score:5, Insightful)

by benwb ( 96829 ) writes: on Wednesday November 27, 2002 @11:41AM (#4767519)

Test results and labs come back on computer these days. More and more hospitals are moving to filmless radiology, where all images are delivered over the network. I don't know that much about this particular hospital, but I do know that hospitals en masse are rapidly aproaching the point where a network outage is life threatening. This is not because the machine that goes ping is going to go off line, but because doctors won't have access to the diagnostic tools that they have now.

Parent Share
twitter facebook
Re:Spanning tree (Score:4, Insightful)

by Chanc_Gorkon ( 94133 ) writes: <<moc.liamg> <ta> <nokrog>> on Wednesday November 27, 2002 @11:46AM (#4767558)

Egads no! Dedicated hardware designed for this is the only solution in this kind of case. A PC simply is not. You CAN'T use a hack in a hospital. You should not use a hack like this in a business either, but I understand if it's done this way. Hacks like this can become rather problematic once it's asked to grow. Also most PC's do not have redundancy in power supply and probably doesn't have a raid array (although I have seen a vpr Matrix machine at Best buy with a raid array...Your standard adaptec type included in a lot of MB's now). If I were to do something similar, I would rather do something with AIX or if using Linux, using a server class machine. By the time you do that, you have already spent the money you'd spend on the dedicated stuff.

Parent Share
twitter facebook
Contribution to causality responsibility (Score:5, Insightful)

by hey! ( 33014 ) writes: on Wednesday November 27, 2002 @11:50AM (#4767596) Homepage Journal

Suppose you have footbridge crossing a stream that takes heavy traffic. One day, it collapses with many people on it. One of the people on the bridge weighed 300 lb.

Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.

But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.

I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.

Parent Share
twitter facebook
Problem was with bad Business Practices... (Score:2, Insightful)

by Alyeska ( 611286 ) writes: on Wednesday November 27, 2002 @12:07PM (#4767721) Homepage

Yes, the network failed. Good businesses -- including hospitals -- will allow for system failures through contingency planning.
I develop business practices for large industries (including in the past the Trans-Alaska pipeline, et. al.). These industries rely heavily on computers, and each has developed plans and trained their critical personnel for emergencies like power failures, computer failures, etc. Reliance on a single tool to protect safety & environment is bad, m'kay?

Share
twitter facebook
no, identical networks crash in identical ways (Score:2, Insightful)

by Anonymous Coward writes: on Wednesday November 27, 2002 @12:13PM (#4767748)

Interesting how even an army of Cisco engineers couldn't fix the problem. Perhaps a testament to how overly(and needlessly) complex cisco's equipment is...and/or, how bad their certification/training is.

As for "identical separate network", at my old company, we had a pair of Cisco PIX units that were configured in stateful failover; this means they share enough information that if one keels over, not a single connection is dropped.

Unfortunately, the PIX OS release had a bug that would cause a crash every so often, and guess what?

One would crash, then the second would crash immediately.

As mentioned, the issue here was completely improper network structure, with research and production networks one and the same. Does this mean someone can walk in with a laptop and start spewing data and/or false routing info and crash the entire hospital? The responsible parties should be FIRED, given today's labor market; absolutely inexcusable.

I'd also guess improper change control procedures were involved here as well.

Whoever handles the hospital's emergency preparedness should also be fired for not keeping staff familiar with alternative methods(gasp, PAPER!) What if they had a power failure? Happens all the time, and not always because of external causes..."keeping the power on" is not as simple as "install a big backup power plant for the place." As Exodus discovered once at their CA datacenter, backup generators don't always work.

Share
twitter facebook
Downtime Procedures (Score:5, Insightful)

by Kraegar ( 565221 ) writes: on Wednesday November 27, 2002 @12:17PM (#4767784)

Posting this kind of late, but it needs to be said.
I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.
I don't know how that hospital was able to pass inspections.

Share
twitter facebook
Why not fix spanning tree? (Score:3, Insightful)

by m1a1 ( 622864 ) writes: on Wednesday November 27, 2002 @12:49PM (#4768078)

If the problem is with spanning tree protocol then they already have redundant connections in place (or they wouldn't need spanning tree). From my experience spanning tree works really well on its own, and is even a little robust to people fucking with it. So the question is, why not deny everyone access to the switches and routers except for one or two administrators. It sounds to me like if they kept people from screwing with the network it would be fine.

Share
twitter facebook
Multiple Problems and Multiple Solutions (Score:2, Insightful)

by SuicidalSquirrel ( 97227 ) writes: on Wednesday November 27, 2002 @12:56PM (#4768141)

First of all, this was apparently a flat layer-2 network. From the information I have seen, it was a very large network. Spanning tree is a wonderful protocol and layer-2 networks are not bad things, BUT spanning tree is very complex in a large network, and latency is going to be an issue if there are no routed boundaries to control traffic. I have experience in designing networks for hospitals (and financial institutions and universities and gov't institutions), so I am aware that implementing layer-3 to the edge is not necessarily feasible for many reasons - financial, legacy setups, etc. That being siad, however, there should be some layer-3 at some point to segregate traffic and protect the critical pieces of the network. Identify the critical points of the networks and put redundancy there - i.e. the server farm, critical care monitoring systems, WAN connection. All network equipment vendors have some type of redundancy feature that would take care of automatic failover for these devices.

Full redundancy is impossible - are you really going to have dual NICs in every workstation and expect that everything would just work in the event of a failover? First of all, the expense would be incredible, and the maintenance would be a nightmare. If they are like most institutions, they are already understaffed and overworked - they wouldn't be able to keep something like that together. Dual-home closet switches to redundant routers/switches that are in turn dual-homed to a redundant core. Servers should have multiple NICs that are attached to multiple switches specifically to provide redundancy.

The worst problem here, though, was not the network itself. This is probably the most prevalent common problem to all institutions - they had no test environment. As multiple other posters have pointed out, this experimental database should never have been attached to a production network, regardless of the expected impact it might have. The key word about it is EXPERIMENTAL - you don't know how it might impact anything. As long as there is no separate environment for testing, there's really no such thing as redundancy no matter how the network is configured.

Say, for example, that the application took down the primary network, so the secondary comes up and takes over. Did anyone realize what caused the failover? Probably not, since a properly configured network will failover in a matter of seconds. So, the application is still running. How long until the secondary network fails as well? Then all of the expense and reconfiguration that went into building the redundant network were for nothing.

If this hospital is like most, they have an extremely diverse hodgepodge of equipment - some incredibly old stuff that they keep around because it works and some really cool cutting edge gadgets that everyone can see the benefit of. They've also epxanded the network as needed and tried not to take anything down when they did it, so what they've ended up with is a logical rat's nest. VLANs probably have been created, but they're probably trunked everywhere, because the goal of the expansion was to connect more devices, not to segregate by function. Hospitals don't get down time, so it's not a simple thing to say that things have to be reconfigured. Odds are that the workstations may not all even be on DHCP, so chainging an IP may require a person (back to that understaffed thing again) touching possibly hundreds of workstations. Yes that needs to be done, and I don't know a single network admin who wouldn't agree, but when you have to have outages cleared by upper management who are going to be chewed by the board if the time frame goes longer than you expected, it turns into a lot more than just what is actually best for the network.

The solution: use down time wisely. Stage implementations and keep them within the allotted time frames. And DOCUMENT. I know - nobody likes to do the documentation, but I think we can all say that it's a lot easier to plan migrations if you have documentation of what is currently there. Realize that no matter what you do, it's not going to last forever. Your cable plant probably has a lifspan of 10 years (not to say that you may not get 20 or even 30 years out of it, as long as you're willing to stay slow), but your network devices will probably only be there for 5 years. Are you still going to be there for the next change? Probably not, so be nice to the company and to the people who follow after you and document.

Just my $0.02, and I'm just that blond chick, so what do I know anyway...

Share
twitter facebook
Interesting response (Score:3, Insightful)

by jhines ( 82154 ) writes: <john@jhines.org> on Wednesday November 27, 2002 @01:02PM (#4768193) Homepage

That this happened in a teaching hospital, rather than a large corporation, makes their response much different.

They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.

Share
twitter facebook
A common logical fallacy... (Score:3, Insightful)

by The Ape With No Name ( 213531 ) writes: on Wednesday November 27, 2002 @02:25PM (#4768984) Homepage

... And one that is hard to argue with because it seems to make so much sense is post hoc, ergo propter hoc. For something to be a valid proposition, it must meet two conditions, neccessity and sufficiency. When someone pulls a "It happened after that happened" trick to pin blame, they are meeting the necessary condition with the apparent causal relation of actions. This is the stronger condition intuitively for people. But, under the sufficient condition, where we must show that there is evidence to support the causal relationship. Supporting a claim is counterintuitive. Just ask any foreign policy maker in the US...

Parent Share
twitter facebook
It's all about the Benjamins (Score:5, Insightful)

by sjbe ( 173966 ) writes: on Wednesday November 27, 2002 @04:01PM (#4769745)

My wife is a doctor. From what I've observed hospitals tend to be penny wise and pound foolish, particularly with regard to their computer systems. Largely for financial reasons they are generally unwilling to hire the IT professionals and spend the $ they need to do the job right.

The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.

Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.

Parent Share
twitter facebook
Union "help" (Score:3, Insightful)

by ces ( 119879 ) writes: <christopher...stefan#gmail...com> on Wednesday November 27, 2002 @05:45PM (#4770595) Homepage Journal

Most union tradespeople I've encountered do actually take pride in doing their jobs right and well. You just have to realize that even the best ones won't generally work any harder than the work rules require them to.

My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Problem was with an application, (Score:5, Insightful)

Of course it can help (Score:2, Insightful)

Leading question (Score:4, Insightful)

A second (unreliable) network? (Score:4, Insightful)

Um.. (Score:4, Insightful)

Re:Problem was with an application, (Score:5, Insightful)

2nd network (Score:4, Insightful)

Politics (Score:1, Insightful)

Re:That's why I hate automatic routing (Score:3, Insightful)

Of course they need another network (Score:5, Insightful)

Re:Problem was with an application, (Score:4, Insightful)

Re:Spanning tree (Score:3, Insightful)

Re:Reliability is inverse to the number of compone (Score:4, Insightful)

Re:Spanning tree (Score:3, Insightful)

Are you crazy? (Score:2, Insightful)

Re:No. (Score:5, Insightful)

The real problem (Score:4, Insightful)

I don't buy it (Score:5, Insightful)

Fix it the first way that works. (Score:3, Insightful)

Re:the sad part (Score:3, Insightful)

Re:Short answer? No. (Score:2, Insightful)

CCNP/CCIEs not what they are cracked up to be? (Score:1, Insightful)

Been there done that, got the ass beating (Score:3, Insightful)

The Solutoin (Score:5, Insightful)

Add a second network? Not likely to help (Score:5, Insightful)

Redundancy and death (Score:2, Insightful)

Life threatening? (Score:3, Insightful)

Re:Hospital Systems (Score:5, Insightful)

Re:Been there done that, got the ass beating (Score:1, Insightful)

I work at a teaching hospital... (Score:5, Insightful)

Maybe not so ridiculious (Score:2, Insightful)

Re:No. (Score:5, Insightful)

Re:Life threatening? (Score:5, Insightful)

Re:Spanning tree (Score:4, Insightful)

Contribution to causality responsibility (Score:5, Insightful)

Problem was with bad Business Practices... (Score:2, Insightful)

no, identical networks crash in identical ways (Score:2, Insightful)

Downtime Procedures (Score:5, Insightful)

Why not fix spanning tree? (Score:3, Insightful)

Multiple Problems and Multiple Solutions (Score:2, Insightful)

Interesting response (Score:3, Insightful)

A common logical fallacy... (Score:3, Insightful)

It's all about the Benjamins (Score:5, Insightful)

Union "help" (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals