Hospital Brought Down by Networking Glitch 575
hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
Well! Woopsy! (Score:1, Interesting)
No. (Score:5, Interesting)
No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.
Spanning tree (Score:2, Interesting)
I think the answer is to disable spanning tree.
We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.
My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.
Disaster recovery (Score:4, Interesting)
No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.
That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...
Re:Problem was with an application, (Score:5, Interesting)
Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.
realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).
Re:Well! Woopsy! (Score:4, Interesting)
However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.
Well the thing is... (Score:1, Interesting)
Rebuilding the system from the ground up poses several major hurdles. First being the systematic migration of data while the original database is still running! as for hospitals, this database is clearly mission critical!
The other problem is mimicing the interface and relationships within the database, such as to reduce retraining. Retraining is a major problem when switching systems. All in all, it is a major undertaking to redo the database, and probably not viable, both in time or money for the hospital.
Saddly, I have to contend that duplication of their system is the best short to medium term solution.
Re:That's why I hate automatic routing (Score:5, Interesting)
This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.
Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people?
Swannie
Re:Spanning tree (Score:5, Interesting)
A) A campus could afford to do Layer 3 at every closet switch
or
B) Live without Layer 2 redundancy back to the Layer 3 core.
I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.
Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.
Spanning tree is our friend, when used properly.
Complexity brings bugs (Score:5, Interesting)
We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.
But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.
Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
Re:Why fly equipment from california?? (Score:2, Interesting)
It would be redundant to have one on each coast, because they were able to get our stuff to us the same day in rural Mississippi.
Cisco implemenatation of Spanning Tree sucks (Score:4, Interesting)
I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.
I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.
Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.
Two wrongs don't make a right.
done right in the first place (Score:3, Interesting)
In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.
Network Utilization Analysis not run yet (Score:2, Interesting)
They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."
Another telling comment about the situation was: "network function was fading in and out".
Re:Well! Woopsy! (Score:2, Interesting)
YES- air traffic management experience... (Score:5, Interesting)
Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.
How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).
How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.
Networks are fragile. (Score:3, Interesting)
Networks are fragile, I'm surprised there arn't more massive outages.
The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.
KISS: Keep it simple, stupid!
Was it OSPF? (Score:2, Interesting)
Re:Problem was with an application, (Score:5, Interesting)
I have not read Network World enough to form an impression of their style, is it watered down to favor advertisers and the general IT purchasing people or is it really a nuts and bolts down to earth mag?
Re:CCNP/CCIEs not what they are cracked up to be? (Score:2, Interesting)
Yes, I have some Cisco certs.
Re:Spanning tree (Score:5, Interesting)
Re:Leading question (Score:4, Interesting)
IMHO if you don't like it then stop reading the damn thing. It's just like TV... If you don't like the channel you're watching then turn it, or turn it off and do something else, but don't bitch because you don't like the content.
Re:I don't buy it (Score:2, Interesting)
the bad network design than any app that
was run on it.
I.e., the app was only able to do as much
damage as it did because the network was
so bad; if the network have been set up
'properly', then the app could have only
done localized damage.
Does that make sense?
Re:Problem was with an application, (Score:2, Interesting)
Re:That's why I hate automatic routing (Score:3, Interesting)
As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.
Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...
Swannie
Re:No. (Score:5, Interesting)
If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.
Re:Simple Answer (Score:4, Interesting)
Medical records are probably the most sensitive records there are, and therefore it's essential that any access to them is both autenticated and audited. The first ensures that only authorized people can access them. The second ensures that in the event of misuse of the records, this can be detected - eg if someone who has autorization to access records decides to look up their neighbours without good reason.
Re:Hospital Systems (Score:2, Interesting)
That wasn't a manned flight :-)
I've heard stories about NASA having competely different teams of programmers in different cities being given the same specs. Of multiple computers running different programs independently controlling separate hydraulics, to the point when if one decides to move something one way, the others can physically force it correct. Now that's redundancy.
I'll bet that people designing new computerized air traffic control systems have never even heard of a real-time system, never mind know what one is.
Fraternal Twins (Score:5, Interesting)
The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.
Re:No. (Score:3, Interesting)
While in the short term the anser is to fix what is broken, they should have had an alternative network set up long ago. When you are dealing with something as important as a hospital, you should have redunancy for everything. that means true redundancy. there should be 2 T1 lines coming in from 2 different vendors from opposite direction if that is something will endanger lives if it breaks. If something is truely mission critical, it should be redundant. If it is life-threatening critical, every single piece should be redundant.
Mission Critical Networks 101 (Score:5, Interesting)
We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.
Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.
These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.
FUNNY NETWORK TRICKS
I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.
-- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.
-- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.
-- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.
And the list of stories goes on. You get the point.
Counterexamples (Score:3, Interesting)
Here are some examples of the ways in which failures can occur that have implied linkages:
(1) Both trains are damaged by an earthquake.
(2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).
(3) The firm has cut the maintenance budget and is neglecting routine maintenance.
(4) The train is sabotaged by disgruntled employees or terrorists.
(5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.
(6) Design flaw means trains do not meet expected performance specifications.
In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.
Re:I don't buy it (Score:2, Interesting)
Makes a lot of sense actually. I've been doing a bit of a campaign for a while to have a seperate domain or the ability to connect my test machines (in complete isolation of course) to only each other and maintain my OWN PDC... of course no one thinks this is a good idea, but some of the tests I need to run can bog down when the network's busy, and they of course are not helping the rest of the network be happy.
Our network's reasonable, but people should give software folks what they need, not force them to work under the constraints the sales folks do (for example).
Sure, we have to respect the 'rules' when joining the normal network for email and such, but testing of network applications should almost be on a smaller completely isolated network (to prevent dragging down the whole system when an automated test goes awry).
Infinite loops don't just happen to stupid people
I know a developer who had to leave one job because the IT folks didn't understand why he couldn't develop windows services without admin equivalence on his local machine (duh).
Re:CLARIFICATION (Score:2, Interesting)
My main reason for posting was to appease my instinctual reaction to the (somewhat intuitive) mistake soemtimes made that having twice the stuff makes it twice as good/reliable, etc. Which holds true for availability (10-fold in fact), but you'll get less in the case of reliability, and manageability is also a concern since you'll have to constantly check the backup network (if it's not in active use, failures are harder to find or predict for that matter). Also, failures aren't always randomly dispersed throughout the network, as the model might imply. You have to figure out how much failure each part of the network can sustain.
So, throwing more hardware, developers, or whatever at the problem isn't a real solution. Figuring out what was wrong in the first place will let them spend their money more wisely, rather than letting all that hardware go to waste, doing nothing. They could possibly get all the redundancy they want with less than twice the hardware and maybe even increase performance of the network during regular usage.
ok, i've totally over spent my $0.02.
Re:Fix it Scotty ! (Score:1, Interesting)
Please RTFM *BEFORE* buying all the gear, when you are in the design phase. If you bought the wrong gear, or if you didn't buy enough memory or whatever, there's not a whole lot I can do to save you.
Please RTFM *BEFORE* connecting it up. We (yes the people who you are on the phone with) write sample configs up for a reason. We set them up in the lab and verify that they work BEFORE they are submitted for the website.
A Case History (Score:3, Interesting)
Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.
In my opinion... (Score:2, Interesting)
The concept is that of "a network of networks", much like Cisco's DCN solution for telco operators. This is a series of interconnected networks that are capable of standing alone in an emergency. These networks are normally oriented around particular application/traffic/usage patterns. An example would be a research network for research workstations, a lab network, a cardic care network, and so on.
All of these networks could exist as seperate layer 2 vlans trunked back to the facility data center,if bandwidth is available. Within the data center, layer 3 routing could handle traffic that needed to cross between these networks. The data center would also have seperate networks for each application group so that applications aren't able to interfere with each other, generally.
Obviously this is an overly broad synopsis and leaves out many details; it is also just as obvious that I'm talking about a campus environment here and not a WAN, where the same theory will work, but with different implementation.
Re:Um.. (Score:5, Interesting)
Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.
In this case, staying with netware has saved lives.
Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.
Re:What is spanning tree protocol? (google whoring (Score:2, Interesting)
Some workstations turn up their ethernet link by software, and then try to use the port right away to, for instance, obtain a DHCP lease.
Spanning tree starts doing its work as soon as it sees ethernet link. So, there's a delay between the time the link comes up and when traffic starts to pass.
Apple's DHCP implementation was bitten by this [apple.com] on some of their machines, affecting the startup of the Appletalk stack, which unlike DHCP, will not retry its initial auto-configuration and address discovery.
I've always been skeptical of "intelligence" added to layers below 3. There are always unforseen interactions and consequences to ANY variance from a set standard.
- Peter
Re:Spanning tree (Score:2, Interesting)
There's an even easier way to fix the problem in your example... don't give the idiots access to multiple ports in the same network.
And I would submit it's not very wise to create a city sized switched ethernet network.
Re:Hospital Systems (Score:2, Interesting)
CS programs are supposed to teach both the theory AND the operations of current technology. This should allow CS grads to quickly learn new technology incrementally. That's the point of these programs.
People coming out of tech schools are fine, but they often have no idea how things REALLY work (just "if "a" happens then I'm supposed to do "b" type of knowledge).
OK...I'm pretty bored with the thread now.