Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
News

Hospital Brought Down by Networking Glitch 575

hey! writes "The Boston Globe reports that Beth Israel Deaconess hospital suffered a major network outage due to a problem with spanning tree protocol. Staff had to scramble to find old paper forms that hadn't been used in six years so they could transfer vital patient records and prescriptions. Senior executives were reduced to errand runners as the hospital struggled with moving information around the campus. People who have never visited Boston's Medical Area might not appreciate the magnitude of this disaster: these teaching hospitals are huge, with campuses and staff comparable to a small college, and many, many computers. The outage lasted for days, despite Cisco engineers from around the region rushing to the hospital's aid. Although the article is short on details, the long term solution proposed apparently is to build a complete parallel network. Slashdot network engineers (armchair and professional): do you think the answer to having a massive and unreliable network is to build a second identical network?"
This discussion has been archived. No new comments can be posted.

Hospital Brought Down by Networking Glitch

Comments Filter:
  • Well! Woopsy! (Score:1, Interesting)

    by uberred ( 584819 ) on Wednesday November 27, 2002 @10:42AM (#4766983)
    This is almost too good... could someone have hacked in to their network and deliberately taken it down?
  • No. (Score:5, Interesting)

    by Clue4All ( 580842 ) on Wednesday November 27, 2002 @10:45AM (#4767009) Homepage
    do you think the answer to having an massive and unreliable network is to build a second identical network?

    No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.
  • Spanning tree (Score:2, Interesting)

    by skinfitz ( 564041 ) on Wednesday November 27, 2002 @10:49AM (#4767030) Journal
    do you think the answer to having an massive and unreliable network is to build a second identical network?"

    I think the answer is to disable spanning tree.

    We had a similar problem here (large academic installtion, hundreds of workstations, several sites) with things (before my time I hasten to add) being one Big Flat Network (shudder) using IPX primarily and Novell. Needless to say this was not good. I've since redesigned things using IP and multiple VLANS, however there is still the odd legacy system that needs access to the old net.

    My solution was to tap the protocols running in the flat network and to put these into VLAN's that can be safely propagated around the layer 3 switched network and presented wherever we wish. The entire "flat" network is tapped into a VLAN and the IP services that are running on it routed into. Any problems with either network and we just pull the routes linking the two together if it were to get that bad.
  • Disaster recovery (Score:4, Interesting)

    by laughing_badger ( 628416 ) on Wednesday November 27, 2002 @10:50AM (#4767041) Homepage
    do you think the answer to having an massive and unreliable network is to build a second identical network?

    No. They did everything right. Falling back to paper and runners is the best they could do to safeguard patients lives. An 'identical' network would be susceptible to the same failure modes as the primary.

    That said, hopefully it wasn't really six years since they had run a disaster exercise where they pretended that the computers were unavailable...

  • by sugrshack ( 519761 ) on Wednesday November 27, 2002 @10:52AM (#4767067) Homepage
    that's a good initial assumption, however my experience with similar issues tells me that you can't pin all of this one one person.

    Yes, this person should have been using an adhoc database (assuming one is set up), however access to various things like this tends to get tied up due to "odd" management practices.

    realistically a backup network sounds good, however there are other ways around this... it could have been prevented with correct administration of the network itself; for instance, in Sybase systems, there are procedures set up to handle bottlenecks like this. (of course, I could be talking out of my a$$, as I'm one of those people without real access anyway... far from root... more like a leaf).

  • Re:Well! Woopsy! (Score:4, Interesting)

    by hey! ( 33014 ) on Wednesday November 27, 2002 @10:52AM (#4767070) Homepage Journal
    I don't think that deliberate malicious action is a very likely cause. The article wasn't for technical folk, so it's anyone's guess; mine is that the network grew gradually to the point where it couldn't be restarted. You can always add a few nodes to a large network, but it isn't necessarily possible to start such a network from a dead stop. Probably a handful of well placed routers would have prevented this.

    However, a network like this could be life-critical, and there probably should be contingencies for a variety of circumstances, including deliberate subversion.
  • Well the thing is... (Score:1, Interesting)

    by Anonymous Coward on Wednesday November 27, 2002 @10:54AM (#4767097)
    Having worked on several database systems, improper planning and maintenance are the main causes of large, unwieldy and ultimately unstable systems. In large organizations where IT is not a major business area, i.e. a Hospital system, their existing database system has probably been augmented several times to increase functionality (and capacity) - probably by different parties as well. This multiple patching approach results in instability as the database has grown far beyond its orginal intended purpose. However, due to the vast stores of data, and the repeated tinkering with it by various parties, migration is a nightmare.

    Rebuilding the system from the ground up poses several major hurdles. First being the systematic migration of data while the original database is still running! as for hospitals, this database is clearly mission critical!

    The other problem is mimicing the interface and relationships within the database, such as to reduce retraining. Retraining is a major problem when switching systems. All in all, it is a major undertaking to redo the database, and probably not viable, both in time or money for the hospital.

    Saddly, I have to contend that duplication of their system is the best short to medium term solution.
  • by Swannie ( 221489 ) on Wednesday November 27, 2002 @10:56AM (#4767115) Homepage
    Routing has nothing to do this, spanning tree is a layer two function, and is responsible for allowing multiple links and redundancy between switches in a network. A properly set-up network running properly set-up spanning tree works wonderfully. Unfortunately, many, many people play with things they don't understand (on a production network no less).


    This whole situation arrives from poor training and poor design. Having several friends that work in hospitals, I know that they typically don't offer a lot of money for IT/Network jobs, and this is what happens when underpaid (read: inexperienced) people are allowed to run such a network.


    Done ranting now, can you tell I was laid off a while ago and now stuck in a contract with a network designed by a bunch of inexperienced people? :)


    Swannie

  • Re:Spanning tree (Score:5, Interesting)

    by GLX ( 514482 ) on Wednesday November 27, 2002 @10:57AM (#4767118) Homepage
    This would imply that either:

    A) A campus could afford to do Layer 3 at every closet switch

    or

    B) Live without Layer 2 redundancy back to the Layer 3 core.

    I'm sure in a healthcare environment, neither is an option. The first is too expensive (unless you buy cheap, and hence unreliable equipment) and the second is too risky.

    Spanning tree didn't cause the problem here. Mis management of spanning tree sounds like it caused the problem.

    Spanning tree is our friend, when used properly.
  • by stevens ( 84346 ) on Wednesday November 27, 2002 @10:58AM (#4767132) Homepage
    The network at my company is quickly becoming so complex that neither I nor the admins can troubleshoot it.

    We have redundant everything -- firewalls, routers, load balancers, app servers, etc. The idea is to have half of everything offsite, so either the main site or the co-lo can go down, and we still rock.

    But with all the zones and NATs and rules and routing oddities, the network is less reliable than before. It takes days for them to fix routing problems or firewall problems. Every little problem means we need three people troubleshooting it instead of one admin.

    Developers suspect that there's a simpler way to do it all, but since we're not networking experts, it's just a suspicion.
  • by marklyon ( 251926 ) on Wednesday November 27, 2002 @10:58AM (#4767134) Homepage
    They have a huge hot lab in California where they have pre-configured switches, routers, ect running and ready to go at a moment's notice. When my ISP went down, they sent (same day) three new racks of modems configured with our last known "good" configuration so all we had to do was unplug, pull, connect.

    It would be redundant to have one on each coast, because they were able to get our stuff to us the same day in rural Mississippi.
  • by xaoslaad ( 590527 ) on Wednesday November 27, 2002 @11:01AM (#4767158)
    I am not up to speed on spanning tree, but speaking with a coworker after reading this article it is my understanding that Cisco equipment runs a new instance of spanning tree each time a new VLAN is created. As you can imagine in such a large campus environment there can be many tens if not hundreds of VLANS. In a short time you turn your network into a spanning tree nightmare. I'd much rather use some nice Extreme Networks (or founrdy or whatever) Layer 3 switching equipment at the core and turn off spanning tree. Use tagged VLANS from the closets to the core and voila no need for spanning tree... Use cisco edge devices for WAN links. Building out a second rats nest out of the same equipment seems foolish.

    I'm not even sure how much Layer 3 switching equipment Cisco has; not much at all from my talking around in the past. It may not be possible to turn around and re-engineer it with the existing equipment; but I think that I would much rather throw out the vendor and reengineer the entire thing correctly before putting in a second shabby network.

    I speak from having assisted on something like this on a very small campus environment (1,500 nodes maybe) and we basically tore out a disgusting mess of a LAN and implemented a fully switched, beautifully layed out network with redundant links to all closets an 8 GB trunk between two buildings etc in the breadth of one weekend. Obviously there was tons of planning involved, cabling run in preparation and so on, but what a fantastic move it was.

    Sure there were hiccups Monday morning, but everything was perfectly fine by the end of the week.

    Two wrongs don't make a right.
  • by wiredog ( 43288 ) on Wednesday November 27, 2002 @11:03AM (#4767174) Journal
    You've never worked in the Real World, have you? It is very rare for a network to be put in place, with everything attached in it's final location, and then never ever upgraded until the entire thing is replaced.

    In the Real World, where you can't shut everything down at upgrade time, a PDP-11 connected to terminals was put in 25 years ago. The PDP-11 was replaced with a VAX, which ran in parallel with the PDP-11 while it was brought online. A few years later a couple of PC's (running DOS 3.0) were hooked up to each other via a Novell network, which was connected to the VAX. Ten years ago the VAX was replaced with a few servers, which ran in parallel with the VAX until they were trusted. Along the way various hubs, switches, and routers were installed. And upgraded as the need arose. The cables were upgraded, also as the need arose, and not all at once.

  • by chopkins1 ( 321043 ) on Wednesday November 27, 2002 @11:03AM (#4767177)
    In the article, it also states that they had just approved a contractor to do a network analysis: "on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time." If the article summary gives the correct information, I'll bet that large parts of their network were overburdened and hadn't been upgraded in years.

    They were probably running at around 30-35% capacity and most networks get REAL funny at around that point. The following comment is rather telling: "The large volume of data the researcher was uploading happened to be the last drop that made the network overflow."

    Another telling comment about the situation was: "network function was fading in and out".
  • Re:Well! Woopsy! (Score:2, Interesting)

    by Ken Dods' dad's dog' ( 628179 ) on Wednesday November 27, 2002 @11:05AM (#4767197)
    I have seen this happen before in an organisation I have worked for. It happened when a second Cisco network (installed by a large well known company) was joined to an existing one and the routing protocol problems of the new network corrupted the existing one. Solution was to disconnect the two and force the external company to rebuild the new network from scratch.
  • by mekkab ( 133181 ) on Wednesday November 27, 2002 @11:08AM (#4767223) Homepage Journal
    Yes. You do things in parallel and you make things redundant. You are fabricating reliability out of unreliable components vis-a-vis TCP over IP.

    Lets talk about real-time systems. No, not "Voice over IP" or "streaming video" crap, I mean REAL human grade real-time systems.

    How do they get 99.99999% reliability? The components they use may be good, but nothing is that good! They get it by 1) removing single points of failure and 2) rigorously analyzing common mode failures (a sequence of failures that brings everything down).

    How is this done? You put things in parallel. Machines are multi-homed. Critical applications are Hot-standby, as are their critical servers. You have the nightmare of constant Standby-Data Management (the Primary sending a copy of its every transaction to the secondary and to the tertiary) but when the power on one side goes out (of course your primary and standby are in differnet buildings connected to different power supplies, right?!) the secondary steps right up.

  • by XPisthenewNT ( 629743 ) on Wednesday November 27, 2002 @11:14AM (#4767268) Homepage
    I am in intern in a networking department where we use all cisco stuff. Spanning tree and some other protocols are very scary because once one switch declares itself a server of a given protocol, other switches "fall for it" and believe the new switch over the router. Getting the network back is not as easy as turning off the offender, because the other switches are now set for a different switch server. Power outages are also very scary because if switches use any type of dynamic protocol, they have to come back up in the right order; which Murphy's Law seems to indicate would never happen.
    Networks are fragile, I'm surprised there arn't more massive outages.
    The answer might be to hire competant network staff, and perhaps train some other IT employees with basic knowledge to help in emergencies. A second network seems a little extreme--both cost and management wise.

    KISS: Keep it simple, stupid!

  • Was it OSPF? (Score:2, Interesting)

    by Anonymous Coward on Wednesday November 27, 2002 @11:15AM (#4767270)
    The article is a little light on technical details, but does anyone know what internal routing protocol they were using? We've got a network with 11 cisco routers running OSPF. The routing changes happen very often, because there's a bunch of dial-ups and a few dozen routes that come and go with short-term connections (like backups from a remote office or running a CC authorization from a remote office). Everything works perfectly if none of our three newest routers are the first powered up. Those three are running IOS 11.0. After several calls to cisco (we buy all cisco internally and for our customer ends, so we get very good support from them) over the past three years, cisco is still stumped as to what the problem could be. The lines in the config file for OSPF are only five lines long, so we (and cisco) are sure there's no problem there. The hospital's problems sounds like it's of the same sort.
  • by nolife ( 233813 ) on Wednesday November 27, 2002 @11:16AM (#4767289) Homepage Journal
    Not only that but they gave the impression no one had problems using the old paper method. Actually noting that at times the network was fine but they decided to stick with the backup method until the issue was resolved because it was harder switching back and forth when the network was working. All in all though they made a point that no appointments were missed, no surgeries were cancelled etc.. Meaning business was as usual but using a backup manual method.

    I have not read Network World enough to form an impression of their style, is it watered down to favor advertisers and the general IT purchasing people or is it really a nuts and bolts down to earth mag?
  • by JohnnyBolla ( 102737 ) on Wednesday November 27, 2002 @11:21AM (#4767335) Homepage
    True. For the most part, having a Cisco cert means you studied hard on how to pass the cert, it really has little bearing on wheather or not you can do the work. Not to say that a chimp can pass them, but I have met some people that couldn't troubleshoot a toaster problem with CCNPs.
    Yes, I have some Cisco certs.
  • Re:Spanning tree (Score:5, Interesting)

    by stilwebm ( 129567 ) on Wednesday November 27, 2002 @11:21AM (#4767337)
    I don't think disabling spanning tree would help at all, especially on a network with two campuses with redundant connections between buildings, etc. This is just the type of network spanning tree should help. But it sounds to me like they need to do some better subnetting and trunking, not necessarily using Layer 3 switches. They might consider hiring a network engineer with experience on similar campuses, even large univertsity campuses, to help them redesign the underlying architecture. Spanning tree wasn't the problem, the architecture and thus the way spanning tree was being used was the problem.
  • Re:Leading question (Score:4, Interesting)

    by enkidu55 ( 321423 ) on Wednesday November 27, 2002 @11:23AM (#4767351) Homepage Journal
    Isn't that the whole point in posting a story? To foster your own personal agendas? What would be the point in making a contribution to /. then if everything was vanilla in format and taste. You would think that the members of the /. community would feel a certain sense of pride knowing that their collective knowledge could help another business/community out with some free advice.

    IMHO if you don't like it then stop reading the damn thing. It's just like TV... If you don't like the channel you're watching then turn it, or turn it off and do something else, but don't bitch because you don't like the content.
  • Re:I don't buy it (Score:2, Interesting)

    by NecroPuppy ( 222648 ) on Wednesday November 27, 2002 @11:25AM (#4767370) Homepage
    I think he's laying more of the fault at
    the bad network design than any app that
    was run on it.

    I.e., the app was only able to do as much
    damage as it did because the network was
    so bad; if the network have been set up
    'properly', then the app could have only
    done localized damage.

    Does that make sense?
  • by ipstacks ( 629748 ) on Wednesday November 27, 2002 @11:25AM (#4767372)
    Routing is the solution. Anyone that runs a layer two network beyond one switch should be fired. Routing convergence is much faster than spanning-tree (even with the Cisco tweaks). Why would I want layer two when layer routers are capable of wire-speed routing?!
  • by Swannie ( 221489 ) on Wednesday November 27, 2002 @11:25AM (#4767373) Homepage
    Can you make a case why spanning tree is bad? Beyond "It's old", or "I've been burned before?" I've never, personally, heard a good arguement as to why spanning tree is bad.


    As for why it's good, it can provide layer two redundancy at a very small cost (basically the cost of an additional cable). While the same can be provided with a routed network, at layer 3, the cost is much higher, and a properly configured spanning tree based network will failover very quick and provide lots of trouble free operation.


    Beyond that, spanning tree can often protect people from themself. What happens when that intern plugs a cable in the wrong place and creats a bridging loop? You guessed it, no spanning tree, no protection for bridging loops, and you can kiss all, or part (depending on the design) of your network good bye, oh and good luck finding that cable espically if it's a big place, don't think that intern is going to admit his error and get fired...


    Swannie

  • Re:No. (Score:5, Interesting)

    by pubjames ( 468013 ) on Wednesday November 27, 2002 @11:38AM (#4767492)
    I spoke to an electrician at our local hospital recently. He told me the hospital had three separate electricity systems - one connected to the national grid, one connected to an onsite generator which is running all the time, and a third connected to some kind of highly reliable battery system (sorry can't remember the details) for life support and operating theatres in case both the national grid and the on-site generator fail simultaneously.

    If they have that level of redundancy for the electrics then I see no reason why they shouldn't for the network.
  • Re:Simple Answer (Score:4, Interesting)

    by gorilla ( 36491 ) on Wednesday November 27, 2002 @11:43AM (#4767542)
    Having worked in a hosptial, I'll tell you that's not acceptable.

    Medical records are probably the most sensitive records there are, and therefore it's essential that any access to them is both autenticated and audited. The first ensures that only authorized people can access them. The second ensures that in the event of misuse of the records, this can be detected - eg if someone who has autorization to access records decides to look up their neighbours without good reason.

  • Re:Hospital Systems (Score:2, Interesting)

    by gorf ( 182301 ) on Wednesday November 27, 2002 @11:48AM (#4767574)

    That wasn't a manned flight :-)

    I've heard stories about NASA having competely different teams of programmers in different cities being given the same specs. Of multiple computers running different programs independently controlling separate hydraulics, to the point when if one decides to move something one way, the others can physically force it correct. Now that's redundancy.

    I'll bet that people designing new computerized air traffic control systems have never even heard of a real-time system, never mind know what one is.

  • Fraternal Twins (Score:5, Interesting)

    by SEWilco ( 27983 ) on Wednesday November 27, 2002 @11:51AM (#4767610) Journal
    I hope the "second redundant network" uses equipment by a different manufacturer and has at least one network technician whose primary duty is that network. That person's secondary duty should be to monitor the primary network and look for problems there. Someone in the primary network staff should have a secondary duty to monitor and check the backup network.

    The ideal would be to actually use both networks, such as by using each on alternating weeks. This ensures that both networks can handle full normal operations and are both operational.

  • Re:No. (Score:3, Interesting)

    by dirk ( 87083 ) <dirk@one.net> on Wednesday November 27, 2002 @12:00PM (#4767683) Homepage
    No, the answer is to fix what is broken. This might be a new concept to some people, but things don't break on there own. If you're doing network upgrades and something stops working, REVERT THE CHANGES AND FIGURE IT OUT. This is reckless and irresponsible behavior.

    While in the short term the anser is to fix what is broken, they should have had an alternative network set up long ago. When you are dealing with something as important as a hospital, you should have redunancy for everything. that means true redundancy. there should be 2 T1 lines coming in from 2 different vendors from opposite direction if that is something will endanger lives if it breaks. If something is truely mission critical, it should be redundant. If it is life-threatening critical, every single piece should be redundant.
  • by rhoads ( 242513 ) on Wednesday November 27, 2002 @12:11PM (#4767741)
    One of the fundamental concepts in building mission critical networks is what is referred to as "A/B Diversity" -- also sometimes called "salt and peppering". The idea is that you build two or more physically and logically separate network infrastructures and distribute the user population evenly across them. Thus, when a catastrophic failure occurs in one of the network "domains", the other will continue to function and business can continue in "degraded" mode.

    We have a long way to go before data networks reach the stability of, for example, the public telephone system. The modern reality is that these networks are susceptible to a host of trivial yet potentially catastrophic failure scenarios. Spanning Tree (STP) is a very unreliable protocol. If has the potential to fail under many conditions such as the presence of physical errors, very high load, or as a consequence of a bug in the OS of one or many network devices.

    Broadcast storms will occur. ARP storms will occur. OS bugs will crop up. Facilities personnel will play jump rope with your cable plant.

    These problems can be mitigated, but not eliminated, by good network design. Thus, in environments such as hospitals and banks, where the cost of network downtime is too great too bear, it is common practice to build one or several parallel infrastructures as I have described.

    FUNNY NETWORK TRICKS

    I used to be in charge of the NOC at a large investment bank in New York. One of our buildings had six floors each housing 1,000 equities traders -- and this was during the stock market boom. Network downtime was not tolerated during trading hours. Therefore, the building was divided into four separate network domains connected to each other, server farms, and the WAN/MAN environment via a layer-3 core.

    -- One time a printer became wedged and proceeded to send out ARP requests at the rate of thousands per second. The flood of messages pegged the CPUs of the routers servicing that domain and brought network services to a halt. Time To Resolution: 20 minutes (proud to say) to deploy sniffer, identify offending host, and rip its cable out of the wall with extreme prejudice. % of building affected: 25.

    -- Over the course of several months, the Novell/NT team progressively decommissioned Novell servers and replaced them with W2K servers. Unfortunately, nobody thought to turn off the Netware services in the roughly 1,000 printers deployed throughout the building. On one glorious day, the very last Netware server was decommissioned in a particular domain leaving the printers in that domain with no server to "attach" to. The resultant flood of SAP messages became so great that the Cisco routers could not service them in a timely manner and they became cached in memory. The routers would gradually run out of memory, spontaneously reboot, and repeat the cycle. Time To Resolution: ONE FULL DAY. % of building affected: 25. Number of hours spent in postmortem meetings: ~15.

    -- On several occasions, Spanning Tree failed resulting in loss of network services for the affected domain. Time To Resolution: 15 minutes to identify problem and perform coordinated power cycle of Distribution switches. % of building affected: 25.

    And the list of stories goes on. You get the point.
  • Counterexamples (Score:3, Interesting)

    by hey! ( 33014 ) on Wednesday November 27, 2002 @12:14PM (#4767757) Homepage Journal
    As pointed out elsewhere, the key assumption is independence -- that breakdowns are like rolling dice. You have to consider the causes of the failure. Virtually every realistic scenario you can think of has a dependent aspect which links the possible failure of trains.

    Here are some examples of the ways in which failures can occur that have implied linkages:

    (1) Both trains are damaged by an earthquake.

    (2) New instructions for routine maintenance were printed incorrectly (e.g. causing the mechanics to under torque a critical bolt).

    (3) The firm has cut the maintenance budget and is neglecting routine maintenance.

    (4) The train is sabotaged by disgruntled employees or terrorists.

    (5) Fuel filters delivered by manufacturer are faulty or incorrectly manufactured.

    (6) Design flaw means trains do not meet expected performance specifications.

    In reality, failures tend to be linked rather than independent. You can't use simply multiplicative logic, you have to use Bayesian logic. P(B|A) P(B): the probability of B given A is different than the probability of B in the absence of any other information. The FAA and military know this. If an aircraft crashes, then all aircraft of the same type are typically grounded for a period while the problem is analyzed to eliminate some kind of systematic flaw.
  • Re:I don't buy it (Score:2, Interesting)

    by patter ( 128866 ) <<pat> <at> <sluggo.org>> on Wednesday November 27, 2002 @12:29PM (#4767919) Homepage Journal
    While it is never said directly, the implication is that the network was a in bad shape to begin with, and when this guy started doing whatever he was doing, it just pushed things over the edge

    Makes a lot of sense actually. I've been doing a bit of a campaign for a while to have a seperate domain or the ability to connect my test machines (in complete isolation of course) to only each other and maintain my OWN PDC... of course no one thinks this is a good idea, but some of the tests I need to run can bog down when the network's busy, and they of course are not helping the rest of the network be happy.

    Our network's reasonable, but people should give software folks what they need, not force them to work under the constraints the sales folks do (for example).

    Sure, we have to respect the 'rules' when joining the normal network for email and such, but testing of network applications should almost be on a smaller completely isolated network (to prevent dragging down the whole system when an automated test goes awry).

    Infinite loops don't just happen to stupid people ;). Anyone can get too tired to realise they're sending a billion packets a second because they reversed a conditional or something.

    I know a developer who had to leave one job because the IT folks didn't understand why he couldn't develop windows services without admin equivalence on his local machine (duh).
  • Re:CLARIFICATION (Score:2, Interesting)

    by ChimChim ( 54048 ) on Wednesday November 27, 2002 @12:44PM (#4768033) Homepage
    Yes, i'm not the wizard of words (or apparently math ;) this morning am i?

    My main reason for posting was to appease my instinctual reaction to the (somewhat intuitive) mistake soemtimes made that having twice the stuff makes it twice as good/reliable, etc. Which holds true for availability (10-fold in fact), but you'll get less in the case of reliability, and manageability is also a concern since you'll have to constantly check the backup network (if it's not in active use, failures are harder to find or predict for that matter). Also, failures aren't always randomly dispersed throughout the network, as the model might imply. You have to figure out how much failure each part of the network can sustain.

    So, throwing more hardware, developers, or whatever at the problem isn't a real solution. Figuring out what was wrong in the first place will let them spend their money more wisely, rather than letting all that hardware go to waste, doing nothing. They could possibly get all the redundancy they want with less than twice the hardware and maybe even increase performance of the network during regular usage.

    ok, i've totally over spent my $0.02.
  • Re:Fix it Scotty ! (Score:1, Interesting)

    by Anonymous Coward on Wednesday November 27, 2002 @12:45PM (#4768045)
    I wish there were more users like you. I work for Cisco, and MANY times have to ask 'who designed this for you?' or 'can I speak with your network administrator?' only to find out that the customer is the one that came up with the bad design in the first place. When I try to suggest that they screwed up on their topology and we should change it, then either they don't want to, don't have a maintenance window, or have some other lame excuse.

    Please RTFM *BEFORE* buying all the gear, when you are in the design phase. If you bought the wrong gear, or if you didn't buy enough memory or whatever, there's not a whole lot I can do to save you.

    Please RTFM *BEFORE* connecting it up. We (yes the people who you are on the phone with) write sample configs up for a reason. We set them up in the lab and verify that they work BEFORE they are submitted for the website.
  • A Case History (Score:3, Interesting)

    by Baldrson ( 78598 ) on Wednesday November 27, 2002 @12:45PM (#4768048) Homepage Journal
    A major corporation wanted to go paperless. They had all sorts of IDEF graphs [idefine.com] and stuff like that to go with. I was frightened for them and suggested that maybe a better route was to start by just going along the paper trails and, instead of transporting paper, transport physical digital media -- sneaker-net -- to workstations where digital images of the mail could be browsed. Then after they got that down they could put into place an ISDN network to the phone company which would allow them to go from sneaker-net to a network maintained by TPC. If TPC's ISDN support fell apart they could fall back to sneaker-net with physical digital media. Only after they had such a fail-safe "network" in place -- and deliberately fell back on it periodically and randomly to make it robust -- would the IDEF graphs start being generated from the actual flow of images/documents. By then of course there would be a general attitude toward networks and computers that is quite different from that of the culture that typically surrounds going paperless.

    Unfortunately more 'radical' minds prevailed and the project was eventually abandoned after $100M.

  • In my opinion... (Score:2, Interesting)

    by freebase ( 83667 ) on Wednesday November 27, 2002 @12:46PM (#4768052)
    First, I don't have all the details of what happened, nor do I have any idea of what the network looked like prior to the outage. However, I have a general design philosphy based on my experience with teaching hospitals and telco networks.

    The concept is that of "a network of networks", much like Cisco's DCN solution for telco operators. This is a series of interconnected networks that are capable of standing alone in an emergency. These networks are normally oriented around particular application/traffic/usage patterns. An example would be a research network for research workstations, a lab network, a cardic care network, and so on.

    All of these networks could exist as seperate layer 2 vlans trunked back to the facility data center,if bandwidth is available. Within the data center, layer 3 routing could handle traffic that needed to cross between these networks. The data center would also have seperate networks for each application group so that applications aren't able to interfere with each other, generally.

    Obviously this is an overly broad synopsis and leaves out many details; it is also just as obvious that I'm talking about a campus environment here and not a WAN, where the same theory will work, but with different implementation.
  • Re:Um.. (Score:5, Interesting)

    by Anonymous Coward on Wednesday November 27, 2002 @12:47PM (#4768063)
    They're called "accountants". My father is a netadmin by trade, and the thing that stresses him most about his job is how, quote, "fucking bean counters" make the purchasing decisions for him.

    Example: They want to replace Netware fileservers (they've something around four years uptime, and that's including them having their RAIDs expanded. All that's going to stop them is a man with a sledgehammer) with Windows ones. While Windows servers, if configured correctly, are really stable, they are not stable enough for truly mission-critical jobs (in this case, dealing with insurance companies and medical evacuation. Time is not just money, it's life) yet the idiots in charge have been suckered by Microsoft's marketing.

    In this case, staying with netware has saved lives.

    Accountants have too much control. They do not understand that if something in vital, you do NOT give it anything less than the very best money can buy. So it'll cut into your profit margins. So what? At least you will still have the margins.
  • by jerde ( 23294 ) on Wednesday November 27, 2002 @01:22PM (#4768383) Journal
    Well, mostly transparent to end stations.

    Some workstations turn up their ethernet link by software, and then try to use the port right away to, for instance, obtain a DHCP lease.

    Spanning tree starts doing its work as soon as it sees ethernet link. So, there's a delay between the time the link comes up and when traffic starts to pass.

    Apple's DHCP implementation was bitten by this [apple.com] on some of their machines, affecting the startup of the Appletalk stack, which unlike DHCP, will not retry its initial auto-configuration and address discovery.

    I've always been skeptical of "intelligence" added to layers below 3. There are always unforseen interactions and consequences to ANY variance from a set standard.

    - Peter

  • Re:Spanning tree (Score:2, Interesting)

    by Cramer ( 69040 ) on Wednesday November 27, 2002 @05:17PM (#4770317) Homepage
    That's handled by "partitioning" on the same switch. Most switches are smart enough to tell they've been plugged into themselves. And even if they don't, broadcast suppression will catch such setups really well -- all it takes is one broadcast packet to flood both ports. STP prevents loops between switches. In this case, that'd be plugging ports from multiple switches into the same hub.

    There's an even easier way to fix the problem in your example... don't give the idiots access to multiple ports in the same network. :-)

    And I would submit it's not very wise to create a city sized switched ethernet network.
  • Re:Hospital Systems (Score:2, Interesting)

    by lucifuge31337 ( 529072 ) <daryl@in t r o s p e c t . n et> on Friday November 29, 2002 @12:25PM (#4780025) Homepage
    They are't over-educated for a damn thing. They are under-educated for everything. Don't give out credit where it's not deserved.

    CS programs are supposed to teach both the theory AND the operations of current technology. This should allow CS grads to quickly learn new technology incrementally. That's the point of these programs.

    People coming out of tech schools are fine, but they often have no idea how things REALLY work (just "if "a" happens then I'm supposed to do "b" type of knowledge).

    OK...I'm pretty bored with the thread now.

It is easier to write an incorrect program than understand a correct one.

Working...