Hospital Brought Down by Networking Glitch 575
hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
Problem was with an application, (Score:5, Insightful)
Large campus networks hosting extremely critical live applications may need to be subdivided by more than a switch, yes.
Of course it can help (Score:2, Insightful)
But will anyone know when one network fails? If not, then how will they fix it? If they don't fix it, then doesn't that mean that they really only have one network?
Which puts them right back to where they were.
Of course, if they put a redundant network in, then fix their problems to try to prevent this issue happening in future, then they'll be in much better shape the next time their network gets flushed with the medical waste.
Leading question (Score:4, Insightful)
Am I the only person getting tired of story submitters using Slashdot to support their personal agendas?
A second (unreliable) network? (Score:4, Insightful)
"Why buy one when you can buy two at twice the price?"
Um.. (Score:4, Insightful)
Even the best networks will come unglued sooner or later. It's surprising to see that most business' networks need prime operating conditions to function properly.
Re:Problem was with an application, (Score:5, Insightful)
2nd network (Score:4, Insightful)
Politics (Score:1, Insightful)
Doesn't really matter. If you had to deal with Med Students as we do, you'd die before you went to the doctor. Trust me.
Re:That's why I hate automatic routing (Score:3, Insightful)
How do you handle mobile users? What about dialup static IP addresses from multiple RAS devices?
Hand-editing of routing tables works only in the most simple of networks.
Of course they need another network (Score:5, Insightful)
Re:Problem was with an application, (Score:4, Insightful)
Re:Spanning tree (Score:3, Insightful)
On a network as complex and messy as theirs? That's basically the situation where you need spanning tree, or else it just crumbles to dust once they do produce a loop...
Re:Reliability is inverse to the number of compone (Score:4, Insightful)
0.1 * 0.1 = 1%
So as long as it can run on just one out of two, get you get ten-fold increase in stability.
Re:Spanning tree (Score:3, Insightful)
Are you talking about a different spanning tree protocol than I think you're talking about? Spanning tree is a very good thing to run to stop loops exactly like this. More than likely one of the hospital network techs misconfigured something and ended up disabling it (portfast on two access points linked into another switch accidently or a rogue switch?).
Are you crazy? (Score:2, Insightful)
No, disabling STP is NOT an option. Learning how to use STP properly is the option.
Re:No. (Score:5, Insightful)
So the answer is - Yes. In a situation where 100% uptime is demanded, the only solution is redundant systems.
The real problem (Score:4, Insightful)
So what's the lessons?
1) Make sure your solution scales, and be ready in case it doesn't.
2) Make sure some overall organization can control how networks get connected.
I don't buy it (Score:5, Insightful)
People when they are doing debugging tend to fasten onto some early hypotheses and work with it until proven definitively false. Even if jobs aren't on the line people often hold onto their first explanation too hard,. When jobs are on the line nobody wants to say the assumptions they were working under for days were wrong, and some people will start looking for scapegoats.
The idea that one researcher was able to bring the network down doesn't pass the sniff test. If this researcher was able to swamp the entire campus network from a single workstation it would suggest to me bad design. The fact that the network did not recover on its own and could not be recovered quickly by direction intervention pretty much proves to me the design was faulty.
One thing I would agree with you is that the hospital probably needs a separate network for life critical information.
Fix it the first way that works. (Score:3, Insightful)
It may not be the right way to do it, but they're running a hospital, and might not have the time to let their network people puzzle it out.
Re:the sad part (Score:3, Insightful)
sneaker-based when everyone must run throughout passing paper;
warehouse-based when rows upon rows of storage are now required to keep all these bits of paper;
administrative overhead based when you realize that it takes two minimum-wage file clerks for every one form per desk - not functional area - to file and find and that takes a LOT of time;
and Mexican-based (yes, I said Mexican - who do you think most major businesses pay to do this? I know for a fact they ship things like this there by the truckload.) when you need cheap data entry and "error checking" [which is very unreliable when they can't read your language!] to enter information that could not be read from handwriting and then index them with a reasonable filing code.
Having spent a considerable amount of time as an admin assistant myself; and later as a document imaging and workflow support person, I can tell you that the cost and manpower savings far outweigh any perception or consideration for robustness or reliability.
The PHBs - or very likely the 'managed care' people (and that should have been put in quotes too) that provide a lot of the funding for the hospitals likely decided to save a few thousand since it wasn't lifesaving equipment or blood products/pharmaceuticals/etc.
Re:Short answer? No. (Score:2, Insightful)
CCNP/CCIEs not what they are cracked up to be? (Score:1, Insightful)
Hrmm, says that many CISCO engineers rushed in to "save the day" and did not get it fixed. I have seen this before. Perhaps those CISCO CCNP/CCIEs are not really that good... Then again, as someone else pointed out, if the current network engineer at the hospital did not have the common sense to revert any changes that were made, or figure out a (relatively) simple spanning tree problem, he should be the 1st to go. Sheesh, people need to recall the fundamentals of networking and protocols before they are made heads of very large networks.
Been there done that, got the ass beating (Score:3, Insightful)
I know it's hard for everyone to believe, but vendors lie and those whiz bang network tools can screw you over.
We have several thousand users on our campus with several thousand computers. We run about a half a dozen 6500 series Cisco Switches. Spanning tree re-calculations take about a second or 2. This is no big deal. And your traffic is re-routed nicely when something goes wrong. But if an interface (which is an uplink into the other switches) is freaking out and going up or down, the whole network will grind to a halt with spanning tree.
Test Network GOOD (if you have the money).
The Solutoin (Score:5, Insightful)
The solution is to put some edge routers in every building (Cisco 6509's with MSFC cards). segment each building into different IP networks. Route between the networks. That way you may lose a building if the spanning-tree goes futzed but you won't lose the whole campus.
Sure you'll be a touch slower routing between the segments but you'll have much more reliability.
Add a second network? Not likely to help (Score:5, Insightful)
Of course not. Two solutions are more obvious:
This might also be a good reminder to get very aggressive "liquidated damages" clauses in contracts like this, or to buy insurance. If a patient dies because of the network outage, I am sure that everyone in the supply chain will be named in the lawsuit.
The liquidated damage clause is intended to provide an unambiguous motivation for the technology provider to fix the problem quickly, while the insurance would cover all or a portion of the losses if there is a failure.
I would be extremely surprised if a huge campus like this one did not have a substantial number of different technologies in use, including wireless, and clearly networking them all into the same patient-records database is a challenge.
Redundancy and death (Score:2, Insightful)
A lot of people here have said "build a 2nd network," to which some have basically said, "that's stupid, make your first network run right." I think that if we're talking about life and death of patients, a second network would be a good idea. It's like the high factors of safety built into things like, say, an elevator -- a failure can cause death, so you overbuild it. Remember that you don't have to make everything redundant, just those crital parts of the system. Maybe all the administrators can only use the primary network, but the blood testing labs and nurses' stations and such can use either primary or secondary. Cutting off non-critical traffic during an outage also helps keep the whole system more stable.
Life threatening? (Score:3, Insightful)
The network crash in question screwed up the document process, slowed everything down, and made life inconvenient, but I doubt anyone's life was at risk.
Re:Hospital Systems (Score:5, Insightful)
To be fair, they have gotten much better...
You seem to have forgotten to explain why they were worse.
If they are running thick ethernet and VAX machines, it is probably because nobody has looked at the system recently, presumably because it hasn't failed. This is how things should be.
What terrifies me is that places like hospitals (where things really need to keep working) run systems which have only been around for a few years, and in that time proved themselves to be extremely unreliable, in general.
New features should not be added at the cost of stability, and this is what people seem to be doing all the time. People are perfectly capable of carrying on using paper, and should be trained and have a procedure to do so at a moment's notice. If the job is so complex that paper is simply not an option (this seems unlikely; even air traffic controllers can manage without computers), then computers should have a ridiculous amount of redundancy built in to them, something I've only heard of NASA even approaching.
Re:Been there done that, got the ass beating (Score:1, Insightful)
I work at a teaching hospital... (Score:5, Insightful)
Maybe not so ridiculious (Score:2, Insightful)
Perhaps the seemingly ridiculious "secondary" parallel network can be put in place not for redundancy, but as a tool to migrate the existing devices to a properly configured and routed network. If STP brought the whole thing down to begin with, they are probably flat. VLANs and subnetting at closets with appropriate L1 redundancy and L3 routing is mostly likely the modern network design their IT staff has known for years that they should have, but never had the convincing argument they needed to get management to foot the bill and allow the service disruptions required to make the switch.
Re:No. (Score:5, Insightful)
ostiguy
Re:Life threatening? (Score:5, Insightful)
Re:Spanning tree (Score:4, Insightful)
Contribution to causality responsibility (Score:5, Insightful)
Would it be fair to say that the bridge collapsed because a 300 lb man was on it? It is completely clear that he contributed to the collapse of the bridge, in the sense that he contributed to the stresses on the structure. One might even say he is more responsible than a 100lb woman who was also on the structur at the time.
But, we'd generally expect that a footbridge be engineered to support a 300lb man. Or if not, to isolate the failure (e.g. the planks under him might fall out, but the bridge as a whole should not collapse). It's part of the designer's job to anticipate this.
I've done a lot of troubleshooting in my time, of networks and other systems. One thing I've learned is that in the case of failure you just can't fasten on one thing that is out of the ordinary. At any given time, in a big enough system, something's bound to be out of the ordniary. Even if you can trace, step by step, the propagation of a problem from a single anamoulous event, it is the capacity of the system to propagate the problem that is the real issue, at least if you take a conservative, defensive stance in design.
Problem was with bad Business Practices... (Score:2, Insightful)
I develop business practices for large industries (including in the past the Trans-Alaska pipeline, et. al.). These industries rely heavily on computers, and each has developed plans and trained their critical personnel for emergencies like power failures, computer failures, etc. Reliance on a single tool to protect safety & environment is bad, m'kay?
no, identical networks crash in identical ways (Score:2, Insightful)
As for "identical separate network", at my old company, we had a pair of Cisco PIX units that were configured in stateful failover; this means they share enough information that if one keels over, not a single connection is dropped.
Unfortunately, the PIX OS release had a bug that would cause a crash every so often, and guess what?
One would crash, then the second would crash immediately.
As mentioned, the issue here was completely improper network structure, with research and production networks one and the same. Does this mean someone can walk in with a laptop and start spewing data and/or false routing info and crash the entire hospital? The responsible parties should be FIRED, given today's labor market; absolutely inexcusable.
I'd also guess improper change control procedures were involved here as well.
Whoever handles the hospital's emergency preparedness should also be fired for not keeping staff familiar with alternative methods(gasp, PAPER!) What if they had a power failure? Happens all the time, and not always because of external causes..."keeping the power on" is not as simple as "install a big backup power plant for the place." As Exodus discovered once at their CA datacenter, backup generators don't always work.
Downtime Procedures (Score:5, Insightful)
I work at a hospital, on the networking side of things. It's a fairly large hospital, and we've got some pretty amazing tech here that runs this place. But BY LAW we have downtime procedures. ALL STAFF MUST KNOW THEM. We have practice sessions monthly in which staff uses downtime procedures (pen and paper) to insure that if our network were to be completely lost, we could still help patients. It's the friggin law. Whoever fucked up and hadn't looked at downtime procedures in 6 years should be fired. That's just bullshit.
I don't know how that hospital was able to pass inspections.
Why not fix spanning tree? (Score:3, Insightful)
Multiple Problems and Multiple Solutions (Score:2, Insightful)
Full redundancy is impossible - are you really going to have dual NICs in every workstation and expect that everything would just work in the event of a failover? First of all, the expense would be incredible, and the maintenance would be a nightmare. If they are like most institutions, they are already understaffed and overworked - they wouldn't be able to keep something like that together. Dual-home closet switches to redundant routers/switches that are in turn dual-homed to a redundant core. Servers should have multiple NICs that are attached to multiple switches specifically to provide redundancy.
The worst problem here, though, was not the network itself. This is probably the most prevalent common problem to all institutions - they had no test environment. As multiple other posters have pointed out, this experimental database should never have been attached to a production network, regardless of the expected impact it might have. The key word about it is EXPERIMENTAL - you don't know how it might impact anything. As long as there is no separate environment for testing, there's really no such thing as redundancy no matter how the network is configured.
Say, for example, that the application took down the primary network, so the secondary comes up and takes over. Did anyone realize what caused the failover? Probably not, since a properly configured network will failover in a matter of seconds. So, the application is still running. How long until the secondary network fails as well? Then all of the expense and reconfiguration that went into building the redundant network were for nothing.
If this hospital is like most, they have an extremely diverse hodgepodge of equipment - some incredibly old stuff that they keep around because it works and some really cool cutting edge gadgets that everyone can see the benefit of. They've also epxanded the network as needed and tried not to take anything down when they did it, so what they've ended up with is a logical rat's nest. VLANs probably have been created, but they're probably trunked everywhere, because the goal of the expansion was to connect more devices, not to segregate by function. Hospitals don't get down time, so it's not a simple thing to say that things have to be reconfigured. Odds are that the workstations may not all even be on DHCP, so chainging an IP may require a person (back to that understaffed thing again) touching possibly hundreds of workstations. Yes that needs to be done, and I don't know a single network admin who wouldn't agree, but when you have to have outages cleared by upper management who are going to be chewed by the board if the time frame goes longer than you expected, it turns into a lot more than just what is actually best for the network.
The solution: use down time wisely. Stage implementations and keep them within the allotted time frames. And DOCUMENT. I know - nobody likes to do the documentation, but I think we can all say that it's a lot easier to plan migrations if you have documentation of what is currently there. Realize that no matter what you do, it's not going to last forever. Your cable plant probably has a lifspan of 10 years (not to say that you may not get 20 or even 30 years out of it, as long as you're willing to stay slow), but your network devices will probably only be there for 5 years. Are you still going to be there for the next change? Probably not, so be nice to the company and to the people who follow after you and document.
Just my $0.02, and I'm just that blond chick, so what do I know anyway...
Interesting response (Score:3, Insightful)
They have been open about the problem, in a way that a for profit corporation could never be. This allows the rest of the world to learn from the experience.
A common logical fallacy... (Score:3, Insightful)
It's all about the Benjamins (Score:5, Insightful)
The computer systems at my wife's medical school were apparently run by a herd of poorly trained monkeys. Systems would crash constantly, admin policies were absurd, and very little was done to fix anything. At her current hospital, the residents in her department are stuck with machines that literally crash 10+ times daily. Nothing is done to fix them because that would take expertise, time and $, all of which are either in short supply or withheld.
Hospitals really need serious IT help and it is a very serious problem. This article just illustrates how pathetically bad they do the job right now. I wish I could say I was surprised by this but I'm not.
Union "help" (Score:3, Insightful)
My advice is to get to know any tradespeople you may have to deal with on a regular basis for things like electrical work, moving furniture, etc. It's amazing how far just treating them as fellow skilled professionals will get you. Resorting to bribery (aka "gifts") can also help. If you give the union electrician a bottle of nice scotch or a box of cigars when he adds some new circuts in the server room he is much more likely to come out at 3am on a Sunday morning when you need him NOW.