Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Slashdot.org News

Blow-by-Blow Account of the OSDN Outage 389

The first hint that all was not well came at about 2 a.m. on Saturday, US eastern time, in the form of slow-loading pages. By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond. Uh-oh.

Our network operations staff was shorthanded; one of our most knowledgable people had quit recently to go into business with a friend and had not yet been replaced. Another was in the hospital, ill and unreachable. A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one. Our netops staff is typically "on the bounce" 24/7.

Dave Olszewski, an OSDN programmer who is not technically part of our netops staff and is not trained in our equipment setup, happened to be on IRC at the time. He doesn't live far from the Exodus facility in Waltham, MA, where our server cage lives, so he went there immediately. Kurt Gray, lead programmer, who we dragged out of bed, was not far behind. Hemos and others were awake by then, growing frantic as we found that not only Slashdot, but also NewsForge, freshmeat, OSDN.com, ThinkGeek, and QuestionExchange were down, along with our old -- but still popular -- MediaBuilder and AnmationFactory sites. Arrgh!

This is Kurt's "on the scene" report from Exodus:

Walk into our cage at Exodus and it seems harmless enough but try to learn what everything is doing and where the wires are all going in less than an hour and you could go insane. You're standing in a nice, clean, uncomfortably air-conditioned facility with 150 of VA's FullOn and various other servers humming away. Greeting you at the door is "Big Gay Al" our Cisco 6509, which contains two redundant router modules: Kyle and Stan. If Stan dies, Kyle takes over and vice-versa. Across the cage are two Arrowpoint CS800 load balancing switches: one is racked and idle (as a hot spare) and the other is live and balancing the load for most of our OSDN web sites. Between the Cisco 6509 and the Arrowpoint is a bridging FreeBSD firewall using ipfw rules to block stuff like ping just to drive everyone nuts basically.

"I can't ping your site!"

"Yeah, we know."

Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you.

At this point if you know anything about networking you'll demand an explanantion for why we're using each piece of equipment in the cage and not a WhizBang 9000 SuperRouter like the one you've been using flawlessly that even washes your dishes for you and makes food taste better too... I can only tell you that I'm not the networking design person here, I didn't chose this equipment or configure it but I'm told it's very good hardware as long as you know what you're doing, but as CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."

So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.

"Did you reboot the firewall?" I asked Dave.

"I rebooted everything," he said. "I think's it's the Cisco."

So we console into the Cisco 6509. What a mess. Neither of us understand how this switch was configured and what it is trying to do. We don't fully understand why you can get a console connection on Stan but not Kyle (turns out the standby module doesn't have active console, that's normal).

Headshaking all around. Meanwhile, about 11:40 a.m. Yazz Atlas woke up and got his cell phone reunited with its battery. He picked up his voice mail messages, tossed on clothes, and hustled over to Exodus.

Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords." The op who was supposed to be on duty (the one whose phone was out of hearing) was still nowhere to be found. They called their hospitalized coworker and got the Cisco passwords.

But, says Yazz, "Since the Cisco was rebooted there were no logs to look at. We could ping something on the inside but not everything. On some VLANs we could ping the gateway and others not. The outside world could ping one of the IPs the 6509 handles but not the other. From the inside we could not ping the IP that the outside world could ping. We could ping the one that they couldn't...very frustrating..."

Kurt again:

Several hours of this sort of network debugging went on until 3:00 AM Sunday. By then we had called Cisco for help. They couldn't help us until they saw the switch config and got a chance to review it. We were spent. We had to go to bed and stay down for the night.

Next morning we're back at Exodus and the situation hasn't changed -- our network is unreachable to the outside world. I was hoping that during the wee hours of the morning the Cisco 6509 had become sentient and fixed its own configuration or perhaps a friendly hacker had cracked into it and fixed it for us, or perhaps ball lighting would travel down a drain spout and shock our cage back to life like those heart paddles paramedics use... "It's a miracle!" No such luck.

So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.

So I asked the Cisco technician, Scott, to telnet into our switch and take a look at the config. I figured he'd balk and say, "No I can't do that," because of course this is a tech support number I called so he's going to tell me to give the phone to my mommy if she's there and ask her to log into the switch because, since I don't have a lot of experience with IOS, I must be some kind of idiot to even call tech support without knowing what my HSRP configuration is on VLAN 4. Instead he says, "OK, what's the login password?" I can't believe this... I must have dialed the wrong number, he's not going to just go into our switch and sort this out for me right here and now, is he?

So he's in the switch and he's disgusted and horrified by how we have it configured, and I'm sure he's right. So I ask him, "Well, can you change all that?" I figure he'd say, "No, this your equipment, you fix it yourself," but he doesn't, he says, "Sure, what's the config password?" You gotta be kidding me, I must have dialed the wrong number here... this cannot be a tech support line... you can't actually get a tech support rep on a toll-free number simply to log in and fix your router setup while you whine at him on the phone... this is not real.

So he's in the switch config and he's having a great time pointing out everything some of our people warned us about months ago. He tells me this is wrong, we shouldn't be doing this or that... "Well, then change it if you don't mind," I tell him. "Switch broke. Me dumb. You fix." ...so at one moment Scott wanted to undo some changes. He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird.

Ok, that's all fine, but Scott is still freaked out about how we have the switch configured. Soon I get a call from Barnaby, another hot shot Cisco tech rep. He just logged into our switch and he's horrified too. He wants to walk me through a total switch upgrade and cleanup right now. "Not tonight", I tell him, "I'm burnt and I need to consult some some network people over here before we mess with this any further."

The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers. Instead of getting an answer from Exodus and running to Cisco with it, and then back again, he got Cisco and Exodus engineers to talk directly to each other and work it out. He conferenced an Exodus network engineer to Barnaby at Cisco and, Kurt says, "they talked alien code about VLANs, standby IPs, HSRP, multihoming, etc. etc., and they came to an agreement: our switch config was a mess... but at least Barnaby knew what the settings were supposed to be and an Exodus engineer agreed with him."

Before moving on to the (short) Tuesday outage, here are a few more notes from Yazz:

The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...

Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups. Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there... Fun!!!

Tuesday was router reconfig day. It was originally only supposed to cause "about five minutes" of downtime, so it didn't seem worth posting any kind of notice that it was going to happen. Why the middle of the day instead of a low-traffic post-midnight time? Because this way, if there was any trouble lots of people at Exodus and Cisco would be awake and around to help. And it was a good thing this choice was made. Kurt picks up the story:
Tuesday 11:00 a.m. we're back in the cage. Barnaby is logged into our switch while he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging), helping us by upgrading the Cisco 6509 firmware, then he's going to clean up the config. First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops). Yazz took care of that. From there Barnaby patched the firmware, had me reboot the switch, and we should be down for just 5 minutes. Unfortunately 5 minutes turned into 2 hours.

After the switch reboot part of our network was unreachable again, much like Saturday's episode only this time with a Cisco rep on the phone helping us work it out. Again we started tracing cables all over the cage, pinging every corner of the matrix. Barnaby got an Arrowpoint tech rep, Jim, on the line and into our Arrowpoint. But this is tech support, Jim isn't just going to log into our Arropoint and debug for it for us, right? Wrong, this is Cisco tech support: Jim logs into our Arrowpoint and works with Barnaby to trace packets and debug our network.

For a while we put a cross-over cable in place of the firewall just to be sure the firewall box wasn't jamming us. Nope. Didn't help. Barnaby and Jim are mapping hardware addresses to IP addresses to figure out where each packet is going. Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"

"Hold on," he says over the phone to me. Then the light goes green, and after a few seconds of routers correcting their spantrees we're back online. Everything is back online. All this time it was this little interface to an ignored switch that none of us bothered to account for. Make a big note about in the network documentation, please.

After we came back online Barnaby went ahead and cleaned up our switch configuration, put things the way they ought to be, made our conections sane and stable.

This has not been OSDN's finest week. But we thought it was better to give you the full rundown than try to pretend we're perfect. At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be. If nothing else, perhaps this story can help others avoid some of the mistakes we made. We certainly aren't going to make the same ones again! (~.*)
This discussion has been archived. No new comments can be posted.

Blow-by-Blow Account of the OSDN Outage

Comments Filter:


  • Well deserved though.
  • I think that it's called a lawsuit (or threat thereof). The original post could be considered libel, because if the person in question went somewhere else and had /. on her resume, she might have a really hard time convincing them that's she's worth a darn. IANAL, but it's just my thoughts....
  • by Anonymous Coward on Wednesday June 27, 2001 @08:06AM (#125336)
    I have worked in the Cisco TAC for about 2.5 years. Currently on the routing protocols team (EIGRP, OSPF, BGP etc.) Prior to coming here I had never dealt with people this obsessed with getting everything right all the time. Really they drill it into you. Mandatory perpetual training and such.

    Many people who call don't understand how the system works internally so here's a summary: We have cases in 4 groups, priorities 1 through 4, 1 being the most important. The designation of the priority of the case is entirely up to you as a customer. All cases are P3s by default which more or less means they need resolution within 72 hours. If your network is down and you need help right now, today with no waiting we'll elevate to a P2. If you are in a serious network situation like the one described in the article then it's a P1 and literally everything else stops, a bell goes off and everyone crowds around the tech w/ the problem (unless it's a softball case).

    There are TACs all over the world but for English-speaking customers what usually happens is the US TACs roll over to the Australian TACs in the early evening who in turn roll over to Belgium and then back to the US. P1s get worked 24 hours until they're resolved, and if they're not fixed in less than 4 hours it's not so good for us.

    We have to close about 5 of these cases a day which is sometimes cake (I can't ping my interface which is shut down) and sometimes nasty (redistribution 12 times over).

    Also, those little surveys you get everytime you work with us (Bingos) are very important. If you'll recall you can rate us from 1 through 5 in 8 to 10 different categories. Anyone who doesn't maintain an average of at least 4.59 is not long for the TAC, 2 or 3 months tops.

    The pay is actually kind of crap but there's no better place in the world to prep for your CCIE. I don't think anyone views the TAC as a long-term environment. Too much stress honestly.
  • by Hemos ( 2 ) on Wednesday June 27, 2001 @08:40AM (#125338) Homepage Journal
    Very easy - someone typed in the wrong time. We deemed the shared source to be more important, and that was supposed to go up first.

    Not everything is a conspiracy folks.
  • by Roblimo ( 357 ) on Wednesday June 27, 2001 @06:17AM (#125340) Homepage Journal
    No black eye for Exodus, please. Our router config was not a standard one they support. Exodus dude Derek Lam, especially, went way "above and beyond" this last week.

    - Robin
  • See I run a Linux router/firewall, and I do that too. 99% of the time it works. X fubared the display? No problem, reboot and it works again. ipfilter stopped forwarding NAT packets for some reason? No problem, restart it and it works again. etc.
  • Not anymore, because I've come to determine that X sucks, which is why it's now just a firewall/router box (headless). It used to be an attempt at a Linux workstation before I gave that up...
  • I also agree... This is what "hacking" is really about, solving complex problems through ingenuity and diligence.

  • You put our favorite news engine in the middle of a routing mess that the network engineers had been warning you about for months?

    What were you thinking?

    You must be able to find a nice, comfortable colocation site somewhere.

  • ..and when our qualified personel arrived, we discovered that she wasn't actuually as qualified as we had hoped. Then she quit..

    ..and then she was erased from the latest "official" version of the story. What the fuck is this? This isn't a "blow-by-blow account", it's a service pack to fix the "bugs" in your last account of what happened!

    Go on! Mod me down to -1 again! You'll have to do it a few times before I go below the "post-at-2" threshold!!

  • by Kurt Gray ( 935 ) on Wednesday June 27, 2001 @07:31AM (#125351) Homepage Journal
    I'm not sure what's happening with moderation but since so many people wanting to know: One of our netops quit suddenly Sunday without any explanantion, I assume she was put off by being called in on a weekend and being asked to stay late until it was fixed. I don't know, but these things happen so we deal with it. One thing you don't want to do is publically flame someone who still has your root passwords (although I trust this particular person with our root still), besides we're not mad at her, wish her well, sorry things didn't work out.
  • by Masem ( 1171 ) on Wednesday June 27, 2001 @06:12AM (#125353)
    ...I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician...Instead he says, "OK, what's the login password?" .... he says, "Sure, what's the config password?"...

    Was anyone else waiting for the "*clickity-click* Wow, it looks like your entire root directory was deleted!" punchline? :-)

  • You must clearly have had no contact whatsoever with Oracle, or you must be working for them.

    Oracle are a big company, and vary hugely in the support they give you. I've had situations where I've been given the runaround, like you. Getting passed from extension to extension, explaining my problem over and over again, "oh, umm, we don't do that stuff here, call this number..." and finding out that Bob's on holiday and his secretary has no idea who else I could speak to...

    I've also had situations where Oracle have said our engineers aren't sleeping until this gets fixed, and a few hours later there's a motorcycle courier at my door with a gold disc containing a brand new build of Oracle with the bug fixed. I've had Oracle techs ssh into my servers, I've had the come to the data centre with mysterious CDs containing Oracle software that they don't let outsiders have, and that they erase from your machine once they're done.

    Helps to have (or at least have access to) a high-end support contract, tho'. If you're some kid downloaded 9i onto his Red Hat box, forget it.

  • Uh... TFTP uses UDP, which is a connectionless protocol, you can of course transfer files over more hops, but keep in mind, the more routers, etc you have in the middle, the more chance of a packet being dropped, and one packet can mean quite a bit when your transfering a new IOS image to your cisco ;)

    Now it's been quite some time since I've looked at the TFTP RFC but I'm pretty damn sure it has the capability to request a block be retransmitted in the case of a timeout (packet loss). In fact, I'm sure of it; during the upgrade a few '.'s were noticed amongst a ton of '!'s and the checksum still worked out.

  • by tzanger ( 1575 ) on Wednesday June 27, 2001 @07:20AM (#125357) Homepage

    Was this configuration ever tested?! It sounds like it was put together, prayed over and sent out into the world.

    it would have been simple to test too... pull out one of the uplinks... then the other... now try pulling out some of the webservers... and so on.

  • by tzanger ( 1575 ) on Wednesday June 27, 2001 @07:05AM (#125358) Homepage

    By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem.

    Reboot the database?? WTF? You just proved my point as to why MySQL is NOT ready for primetime. Reboot the fscking database??

    So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.

    Guys, this isn't Windows -- Rebooting is an absolute last resort and if it works then you have discovered a problem, either in hardware or software and it needs fixed, not just a "oh well, a reboot fixed it, life goes on." Bastions of professionalism you're not.

    I don't normally flame people for this kind of thing but the Slashdot crew are especially keen on bashing Windows, yet you resort to their exact tactics whenever a problem comes up.

    Reboot the database?? I still can't believe I read that. Sorry.

    Cisco Systems have some wonderful systems -- Hell I just recently found out about their stack trace analyzer... feed it a "sh stack" and it emails you back a list of IOS and/or hardware bugs which likely caused the crash. That is just plain old SCHWEEEET. Or being able to read their memory mappings to find out what is causing a bus crash... Ideal. You don't just randomly reboot the damn shit to try and get it to work. If it isn't working something is causing it. Embedded systems are generally pretty good at throwing up the red flags; you just need to look for them (logs, stack traces, extensive use of the debugging facilities...) Use the tools at hand instead of the big red button!

    First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops).

    Unless this is something specific to the IOS or router, that's bullshit. I just upgraded 5 AS5248s to IOS 12.1(9) with a TFTP server that is 8 hops away. I'm not aware of any TTL issues with TFTP.

    Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"

    You mention that your network documentation is shitty -- I sure as hell hope you'll push to have it upgraded and maintained with a high degree of readability. Even complex systems do not have to be undocumented just because they're complex. Use pictures, use words. I haven't found anything in IT which cannot be explained by a combination of both. And throw in a glossary for the non-techies like yourself who are called upon to fix it. :-)

    Don't get me wrong; I'm glad you're back up. But this could have been prevented. Very easily from the sounds of it. I hope you did fire your cisco admin; it sounds like s/he didn't have a clue and was too terrified of losing his/her job that s/he didn't ask for help. Cisco has mailing lists, tons of documentation and there are many IRC channels to ask for help.

  • by tzanger ( 1575 ) on Wednesday June 27, 2001 @07:16AM (#125359) Homepage

    Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.

    Up until recently you had no choice but to telnet to Cisco equipment. I came up with a quick solution: deny telnet from anywhere but a same-segment computer (in our case, it's our RADIUS authentication box). Now ssh to the server and telnet from there to the NAS. Problem solved. :-)

    Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!

    While I usually agree, sometimes it is necessary to do a quick check. Even with the number of blackhats out there the chances of them doing anything signficant (or anything at all) for the 2-5 minutes you have the firewall out are insignficantly small.

  • I don't know about Cisco's daytime support, but I can confirm that if you call 'em at three or four in the morning, they're incredibly helpful. I had to pull an all-nighter to get a site live a couple of months ago, and they spent a number of hours on the phone with us figuring out why the traffic wasn't routing (turns out this particular firewall didn't like doing NAT when the internal IP address had a number in the 60's--don't know why). Very polite, knowledgable, and willing to help--certainly the high point of that particular hellish project.
  • by Pii ( 1955 ) <<gro.rebasthgil> <ta> <idej>> on Wednesday June 27, 2001 @07:06AM (#125362) Journal

    I can confirm this. I've been a network consultant for almost a decade, primarily as a Cisco router/switch jock. I've dealt with the TAC (Technical Assistance Center) too many times to count.

    Hold times can vary, depending on time of day, but are never as bad as the stories from other companies. In most cases, you are on the phone with a real, live engineer within 5 minutes.

    90% of the time, the engineer you are transferred to will be able to get your problem corrected. On the few occassions where they have not been able to help me, Cisco has moved mountains to get the right people invloved. I had an issue with Serial SNA - DLSW+ encapulation last year that was escalated to the point where the guy that wrote that portion of the code for IOS was on the phone, and was prepared to come to my client's site (True, they had purchased about $8M dollars in hardware...).

    You do, typically, have to have a Smartnet contract, but as other posters have pointed out, if the problem is not hardware related, they will generally help you straighten out your configurations even without the contract.

    Alot of people like to make comparisons between Cisco and Microsoft. Anyone who has dealt with the two will be quick to dispell any similarities. Cisco is a first-rate organization, with first-rate support, and I've made a career out of working with their products.

  • I have been TAC'd "around the world" literally with Cisco support; One TAC case lasted 32 hours, all on the phone. We went from California, to the East Cost, to Brussels/Belgium, to Egypt, to Asia then Asia-Pacific and back to California. We had several problems that basically caused us to create a new core network from other pieces of equipment, tear down and rebuild the original router from the chassis up after a bad power supply ate the old one, and each and every card in it. This was a few years ago before everyone could afford or manage completely redundant network infrastructure, and things like 2 hour turnaround on hardware was supposed to alleviate things like this. The problem was, some of the cards would pass first level diags, but not run for long. Each part got there in less than 2 hours tho! It was one of those 'one in a million' cases, but the rep on the other end of the phone was cooperative the whole time.

    That said, I've also had low priority cases where they don't respond for weeks; It's almost to the point at times that anything I've opened gets opened at Medium priority (business impact) or higher.
  • True. Even at the best of the best, there will be better and "less better" people. And anyone can have an off day.

    OTOH, I also had the experience of a TAC rep spending 2 hours on the phone with a competitor's tech support line, explaining to them why their config wasn's working. He was right, too.

    A good long-term sales tactic, though: guess whose product I specified the next time.

    I do wonder what will happen to the quality level at Cisco TAC with the recent layoffs, though. The first sign of impending doom at both WordPerfect and Novell was when the tech support quality suddenly headed down the tubes.

    sPh
  • by sphealey ( 2855 ) on Wednesday June 27, 2001 @05:58AM (#125366)
    Yes, if you have a SmartNet contract for that device, it's pretty much true. Cisco, mid-1990's Novell, and Oracle are the only organizations I know of that provide this kind of help. Microsoft "Gold" support plan, anyone? (gag).

    Caveat: Cisco basically does not have first level support (i.e. "'Is the router plugged in?' 'What's a router?') - you are supposed to have second level knowledge and have completed the first level troubleshooting before you call TAC.

    But - I have been out of the office and had brand-new network techs call Cisco with a problem, and they did help out even then.

    sPh
  • There's a reason why thier stuff's pricier than the rest- it's overall reliability (except on their low, low end...) AND the support.

    They really ARE this responsive.
  • by Svartalf ( 2997 ) on Wednesday June 27, 2001 @07:28AM (#125368) Homepage
    Actually it isn't. As the other respondant to your comment pointed out, it's possible to determine system type from the ICMP responses. One should also realize that not all exploits use fragmented ICMP attacks. There's all kinds of abuses of ICMP that could be concievably used to take a system down. It's better to nip any of those in the bud for a high volume site or set of sites.
  • by Oestergaard ( 3005 ) on Wednesday June 27, 2001 @09:54AM (#125369) Homepage
    Luckily, /. is monitored, this historical event will be kept in the monitoring systems for ever and ever ;)
    Go to the monitoring system page. [sysorb.com]
    Click the www.slashdot.org link
    Select services
    This will give you some graphs showing the outtage.
  • by Pseudonymus Bosch ( 3479 ) on Wednesday June 27, 2001 @06:59AM (#125370) Homepage
    A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.

    Well, in this case Slashdot was down. That can explain the instant response.
    __
  • Er, all Nextel phones are made by Motorola.
  • I called cisco one night, while replacing my Nortel BCN with a cisco3662. For some reason, I couldn't get my BGP peer established with Sprint. It was 3am Central US time when I called cisco. I first talked to an individual who stated that they were on a callback. I figured that I would get a call in 45 minutes to an hour. 3 minutes later, my phone rang, and it was a gent from Belgium. He logged into my router for me, found the bgp error immediately, fixed it, and I was on my merry way. He even fixed some of my access-lists for me while he was there.

    Cisco has _the_best_ customers service that I have ever seen. It is good enough, that I don't mind paying a bit more for the hardware, because I know that if it breaks, there will always be someone to help me out.

    And, I don't work for cisco :)
  • "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."

    Not always true. I used to admin JSP-based web servers. My experience is that the Java virtuals machines that server jsp pages have a way of starting to act funny. Stopping and restarting the services fixes the problem.

    If I was ever building a network, I would not allow JSP to be a part of the network for this very reason.

    Then again, if a JSP guru knows what can cause a JSP engine to act wonky, or how to set up a JSP engine so it is stable and doesn't need reboots, please post a follow up describing how to do this.

    - Sam

  • How is calling someone "one of our most knowledgeable people" abusing them?

    Of course the original story [slashdot.org], or, I should say, some of the versions of the original story (how often can you rewrite the original and it still be the original?) mentioned "...when our qualified personel arrived, we discovered that she wasn't actuually as qualified as we had hoped. Then she quit..." which doesn't sound like someone who was already not working there anymore before the troubles started, so I assume that we're talking about 2 different people here, only one of which was identified one way or the other by sex/gender.

    Quite a ways down in the responses to the aforementioned "original" story is an AC post [slashdot.org] signed Anne Tomlinson that seems to give another perspective on the events that weekend. It's a little ways down the page from another post [slashdot.org] that has some of the different versions of the original story.

  • For all we know they are and we can't browse low enough to see it.
  • Okay, so were there any posts at -2 or lower?
  • "Next time you think you are calling technical support droids, next time you think that you will but put on hold for hours, be careful, you may be placing a call to ... the twilight zone."

    Yike, I say. Yike. Competent tech support does not exist in this earth. What planet is Cisco on, and to what worthy cause can I donate money to see that humans never send a manned mission there and pollute this fascinating superior alien culture?
    --G

  • by szo ( 7842 ) on Wednesday June 27, 2001 @05:55AM (#125386)
    Look at it this way: every time he comes back again, all by itself! Other people die once and for all...

    Szo
  • This topic comes up many times on comp.risks: there's no point in having a backup (server, archive, database, router, etc.) unless you TEST your backup procedure to make sure it works. Pull the plug on the server - does the backup kick in? Kick over the router - does it fail over to the backup? Those who ignore the RISKS digest are doomed to repeat it!

    --Jim
  • We'll just have to wait for an article giving a blow-by-blow account of the Slashdot outage article's outage.
  • This plot is a total rip-off of the Miyazaki Classic: "Hagamaki Ortifunk", or (from it's American release) "Whistling in the Dark, with Daisies". Can't Americans think of anything original anymore? Could they ever?

    I'm working on a web site to expose this travesty to the world. I'm sure everyone will be impressed with my esoteric knowledge of this classic of Japanese animation.
  • by TrentC ( 11023 ) on Wednesday June 27, 2001 @08:48AM (#125396) Homepage
    ...Cisco is reporting a projected 40% upswing in earnings for the next quarter, after a favorable review of their technical support personnel on the discussion site Slashdot led to a surge in sales for support contracts.

    "It's the first the the Slashdot effect has been a productive one", said an unnamed Cisco official, pausing briefly to dodge a large bag of cash sailing through a nearby window.

    Jay (=
  • At an old job we had a wee Cisco 1604 router, just doing ISDN for our /24 (at the time ISDN was the only affordable thing in our area)

    I had a problem with something and mailed Cisco. No more than an hour went by and I had email from a real life person in front of me telling me what to do to fix our problem.

    Cisco isn't cheap, but you do get what you pay for.

    grub
  • Nah, if was really the BOFH he would've changed the routing tables so that all requests to slashdot.org get changed to goatse.cx, and then telling the guys that 'Doppler Static Effect' is their problem and that they need to demagnitzie the electrical contacts with their tongues.

    ---
  • Sometimes rebooting will fix the problem. Sometimes you don't have any alternative. Sometimes you can't fix the problem, but you can get things working again (e.g., Windows). And rebooting may the the best (or only) way to do that.

    It is clear that they were out of their depth. It is clear that they didn't know what they were doin. They knew that they didn't know what they were doing. But the experts were unreachable. So they tried something that sometimes works. I really don't see how you can fault them for that. It would, of course, have been better if they had know what their choices and options were, but they didn't.

    I wouldn't have either. Probably most of us wouldn't have.


    Caution: Now approaching the (technological) singularity.
  • Four words: (D)DoS by ping flood.
  • I need to add something here. Of couse, if its nonencrypted telnet, it shouldn't be used most of the time. If its a crisis - then change it to a scrappable password, let the servicengineer do his thing, then change it afterwards.

    Preferrably encrypted login should be used, of course. Be it ssh, telnet-ssl or whatever.


    --
  • Defense in depth is a good philosophy to have, protecting against configuration mistakes.

    Of course.

    You are also protected if exploit code is run (say via a buffer overflow that changes hosts.deny).

    uh? That sounds pretty damn unlikely. The bufferoverflow could just as well execute a reverse-channel back to the attacker. Of course, you limit the possibilities of the attackers. However, you're now already talking about running services with known vulnerabilities.

    Firewalls can also protect against low-level attacks that don't attack the services/applications themselves.

    That is better done at core-routers.

    When properly configured, firewalls can be invaluable in logging traffic and otherwise keeping out unwanted traffic and IP spoofs -- and can do a far better job than simple packet filtering on a router.

    That is better done by snort, or any other decent IDS.

    I think it's pretty poor form to call someone else a dimwit when you're lacking a lot of info yourself. There's a reason that a firewall is industry-wide best practice for an Internet site or user network, and it's not because we're all dimwits

    I regularly call those that thinks running firewalls is the be-all or end-all of security for dimwits. Unplugging a firewall on a network you know isn't exactly a horrible thing to do.

    A Firewall is a good thing to have when you've got a network you don't have time to audit, and that doesn't have people to audit it on a regular basis. Its a good thing to have when you've got servers which you don't have any possibility of patching, or upgrading -- but that needs to be running some services (nonvulnerable) to the internet.

    Of course, you could do lots of these things with NAT-devices. (Which of course isn't a perfect solution neither).

    Blargh, I could rant on forever.
    --
  • by arcade ( 16638 ) on Wednesday June 27, 2001 @11:08AM (#125409) Homepage
    Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.

    Bah, you're talking without knowing the parameters. For all you know, they could've enabled the telnet access on the outbound interface specifically for the checking/cisco rep, disabling it afterwards.

    Secondly -- if I remember correctly you can have pretty damn long passwords on ciscoequipment. We do not know the length of the password, but its highly probable that the password is 10+ characters. A bruteforce-attack is pretty damn difficult when you have to check 64^10 possibilities. According to my bc:

    arcade@lux:~$ echo 64^10 | bc
    1152921504606846976

    Now, that is a pretty impressive number of queries you've got to make to exhaust that pwd-space. To be quite frank -- I don't see the problem.

    Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!

    Oh, yes of course. If you don't have a firewall You are phooked!!

    Ehh? Excuse me? Why the fsck do a properly configured serverfarm need firewalls _at all_? Please, enlighten us with your wisdom oh dimwit.

    Firewalls _are not needed_ if you're not running services that _should not be running_ on servers for the internet.
    --
  • They did change the IP back. The switch over was temporary to get an announcement up ... and that was outside the Exodus cage. Fortunately they did have 1 (out of 3) authority DNS servers outside of there, so they could get people over to the announcement ... eventually as cache TTLs expired.

    It's already bad enough to have a 24 hour expiration on the A-record. But you don't anticipate these outages, so 1D is fairly common practice (even longer in some places trying to reduce their DNS load). But the real mistake was putting 24 hours expriation on the temporary IP. Basically that says "as soon as I change this, everyone who cached this temporary IP address is going to have to wait a day from when they first say the page, before they can get their /. fix (or other OSDN stuff)". What? Did someone actually think they were going to change the IP back 24 hours BEFORE the sites were back up? The temporary A-record should have had a TTL of less than about 30 minutes. I'd have put in 10 minutes if it were me. But then, if I were there, but if I were there, I'd have also been doing the Cisco stuff and actually tested the failover configuration.

    I do recommend:

    • Having at least 4, and maybe even 6, authority DNS servers, all diversity located (I'm sure they can get some located over at VA Linux).
    • Develop specific procedures to handle failures for each piece of equipment.
    • Print the procedures on paper and keep a copy at the cage, in the office, and at a senior manager's home at minimum.
    • Hire an outside consultant in each area to revied the procedures to make sure they make sense to an outsider in the event you might need to go outside to solve the issues.
    • Test the procedures and configurations by scheduling "failures" to see if the major points work as intended.

    These are the kinds of things system and network administrators are supposed to do. Programmers tend to hate that kind of work, so that's why there are separate job descriptions. Just because a good programmer can install and configure a server doesn't mean that just doing that is all that needs to be done. Businesses run smoothly when people know what they are supposed to do. And in the exceptional circumstances, they're doing things they don't routinely do, and it is essential to not only have those things written down, but also make sure they do work, and can be found even in a power failure.

  • In this case, "Full Disclosure" means, "A good yarn."

    --
  • Not a libel suit, but rather /. is afraid of all the angry parents who would claim that his post would make little girls avoid taking classes involving computers, networks, etc., in much the same way that Barbie convinces girls to avoid math.

    --
  • by Midnight Thunder ( 17205 ) on Wednesday June 27, 2001 @06:07AM (#125413) Homepage Journal
    Maybe that's the next site OSDN should come up with. The idea is that anyone who has had a major problem with their network or computers and solved the problem, could post their write up to help others who find themselves in such a situation.

    It definetly enjoyed reading this article and I am sure that it will be bookmarked by a fair few techie minded network admins, just in case.
  • I've had this kind of support as an end user with a Cisco 804 ISDN router. The same quality of support that we were getting on our support contract with a 7206, at my previous job.

    The main reason that they're so prompt, is that they have a global network for phone support. When you call them, your call gets transferred to a technician who has just arrived at work (ie, if you're in the US and call at 3am, you'll probably end up speaking to a technician in central or western Asia).
  • Heh,

    I'm reminded of an intrusion team story about one such team that faked a package from a OS vendor (letterhead, box, etc) containing a "patch." The admins looked at the box, assumed the obvious, and installed the patch which, while fixing an actual problem, also backdoord'd their system.

    I could see running a remote exploit to crash your box, sending you mail about it (faked, of course) and then sending you a "patch" to "fix" the exploit (while adding some of my own...).

    Be careful, there are some tricky bastards around with way too much time on their hands. Check those MD5 sums...

  • The logs are held in a ring buffer in RAM. What you are supposed to do is configure the router/switch with the address of a syslogd server which will handle the logs better.
  • by wesmills ( 18791 ) on Wednesday June 27, 2001 @06:52PM (#125417) Homepage
    The only problem is, it isn't a troll. [Full disclosure: I guess it's time I reveal who I work for on /. for the first time, like anyone cares ... 'tis Microsoft] If you are a Premier or, God help you, Alliance customer, you will get the red carpet treatment in almost every product support division. Even in departments that have combined Premiere & Professional (per-incident) support, there's still the unwritten rule of "Pro calls don't get to push for complicated stuff as hard." First question out of a lot of people's mouths is "Pre or Pro?"

    Of course, it depends greatly on who you are talking to. The platforms team does have a huge slant toward NT/2000 because that's what they support and allegedly like. Those of us in Exchange support (I'll leave it to you to figure out what part of Exch. support I'm in) handle calls where Unix servers are relays, Pix firewalls sit between systems and load-balances continually send packets off into the woods. If you *don't* know non-Microsoft stuff, aren't prepare to acknowledge that non-MS works and works well, or just can't handle the idea of public standards, you are fucked in that group.

    It all comes down to who you get on the phone. If you don't like who you are dealing with, ask to speak with their manager or technical lead. Get it straightened out with them or request another support tech. You're paying for it, get what you are paying for.

    (As always, my comments are my own and my employer doesn't take any responsibility for them. Like they would want to anyway.)

    ---

  • "First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops)."

    I've tftp'd images to cisco's and ascend's across the Internet (many hops) without problems. It's not smart because if you loose your path to the server you're screwed, but it does work.
  • by JabberWokky ( 19442 ) <slashdot.com@timewarp.org> on Wednesday June 27, 2001 @09:36AM (#125422) Homepage Journal
    Can anyone famliar with Cisco, besides people working for /., confirm this?

    I don't normally swear, but if someone asks me if Cisco support is good, I have to reply: "Abso-fucking-lutely". They are easily the tightest organization out there, bar none. I don't think anyone: UPS, the Military, Wall Street, runs as good an operation as they do.

    And I've sat with two engineers at 1:00am through to 11:00am as they fixed my small gateway to an ISP, not a big ticket item. At one point, they did an engineer transfer, connecting me to a different part of the world, and spent thirty minutes overlapped, with the engineers working together to make sure that the new engineer knew what the first had tried. As it turned out, the firmware storage was flakey, and the config corrupted itself semi-randomly.

    Years later, I watched Cisco do the exact same thing - only this time, they correctly identified that the problem wasn't them, but in some Bay routing equipment, *and* they told us the exact commands to fix it (I was a outside consultant just watching, but I believe they even offered to telnet in and fix it themselves).

    So, yes. Cisco is the only brand I will buy, no matter how expensive they are. Think of the extra expense as insurance. You *may* not need it, but it sure pays for itself if you do.

    --
    Evan

  • Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable.

    ...unless the risk of being comprimised within that short period is outweighed by the information you will gain by testing around your firewall. It is a simple trade-off.
  • by Ralph Wiggam ( 22354 ) on Wednesday June 27, 2001 @07:26AM (#125426) Homepage
    I think the Anne Tomlinson post was a particularly brilliant troll.

    A quick Google search for "Anne Tomlinson" returns an orchestra conductor and someone in a retirement community.

    If it was a real post, CmdrTaco probably would have ignored it. His good humored response makes me think it was a troll.

    Is there any evidence that it was real?

    -B
  • by skullY ( 23384 ) on Wednesday June 27, 2001 @10:32AM (#125427) Homepage
    I have to agree with Taco, if they gave this kind of service down at the DMV, they'd be picking up passed out folks left and right.
    From this day forward all slashdot editors shall be known as Taco, reguardless of what their chosen moniker is. This measure will simplify things drastically. No longer will posters have to do the arduous task of scrolling back to the top of a page to see which editor posted a story. Afterall, most the editors don't bother to check what they're posting, why should the readers?
  • by WNight ( 23683 ) on Wednesday June 27, 2001 @09:16AM (#125428) Homepage
    http://www.bmug.org/news/articles/MSvsPF.html

    I beg to differ.

    That article details calling the 900 line, but even with support contracts, most MS tech support reps toe the company line in a distressing fashion.

    "Unplug all the unix servers, that'll fix it"
    "Upgrade everything to Win2k Adv Serv, that'll fix it"
    "Upgrade to SQL Server (from Oracle), that'll fix it."

    They seem to have no ability to distinguish which network components could be involved in a problem and are unwilling to accept that you've already localized the problem.

    Case in point, there was a problem where two WinNT boxes wouldn't see each other. They both had IPs, they could both ping everything else. They were connected via a 100mbps switch.

    We made sure each properly had an IP, that it could reach other machines, that the switch worked, and then swapped ports with two machines that were working just fine. We also tried isolating these two machines on their own switch, to avoid potential IP conflicts.

    When we called the support number we honestly described the situation to the tech. He asked what else was on the network. We explained that it was in a different IP range, but on the same switches as a bunch of Linux machines, an Open BSD (firewall for the desktop machines), and a couple Suns (doing something for the other department, dunno what.)

    He then proceeded to tell us that it was the other computers, despite our telling him that we had isolated the NT boxes in question on their own switch and we still had the problem, but when we put a third computer on, both of the NT boxes could reach it just fine.

    We eventually lied to him, telling him that yes, we had unplugged all the unix machines, etc. (Like we're going to just unplug out company on the say-so of a moron, and like two junior techs would have the authority to do so anyway.) So now jim-bob starts to help, by telling us that Win2k is so much better, etc, that we wouldn't have these problems with it, etc.

    When we flat-out refuse to "upgrade" to fix this bug, his advice is that we format the drives and reinstall. ARGH!

    We finally convince him that these machines are somewhat important and we can't just wipe them everytime there's a small problem.

    After over an hour with this jack-off, we hang-up, problem unresolved.

    We get permission from the boss to call someone in... So we look through our list of contacts and grab someone whose card says they deal with networking and windows. Call him up. As we're describing the problem he listens quietly, grunts affirmatively when we describe how we isolated the problem, agrees that it couldn't be any of the other machines.

    Then he says, "It sounds like it's an issue with a bad route, type 'route .....'" We do, and then we reboot. Problem solved.

    He said that it, whatever it was, was a very common problem where the machines basically forget how to get from A to B. That command zeroed the routing (which didn't show any bad routes) and the reboot brought it back up.

    Cost, a 15-minute phone consultation. $45

    Microsoft tech support was basically a sales department, staffed with the marketing rejects.

    So, don't EVER believe it if someone tells you that MS supports their products. Any company whose line is "Format and reinstall" has no business calling a product "Server", let alone claiming they're in the enterprise level.

    Schon, earlier in this thread, said "Rebooting doesn't solve the problem!!" I wonder what he'd say about formatting and reinstalling.

  • by Flower ( 31351 ) on Wednesday June 27, 2001 @07:30AM (#125440) Homepage
    We have a right to know.

    No, we don't have a right to know. Ms. Tomlinson's departure is between her and her employer; not some tabloid expose for a bunch of overly curious rumor mongering conspiracy theorists. I wouldn't be surprised if the people who blurted this out on a public forum haven't been seriously bitch slapped by HR.

    As a community it would be best to let the matter drop. I'm sure if you were in Anne's position you'd be severely pissed. A little perspective and some empathy would be appropriate.

  • by schon ( 31600 ) on Wednesday June 27, 2001 @10:26AM (#125443)
    Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.

    And who's to say that the problem that's being experienced will be fixed by a reboot?

    We had a server running, one of the things it did was SMB sharing - one of the drives (the one dedicated to non-critical SMB shares, in fact) died.. This box was doing MUCH more than SMB - it was also our internal DHCP, and DNS server

    I was out, and one of our MS guys decided "I don't know what all these error messages mean, but I can't see my windows drives, so I'll just reboot it." Because the drive was dead, the machine wouldn't boot. He took the WHOLE DAMN DEPARTMENT OUT - nobody had DNS, and when people's windows machines stopped working, the solution was (guess what?) REBOOT them - so THEY stop talking to the network altogether.

    Now, the kicker is that the drives in this machine were hot pluggable. If the reboot hadn't happened, I could have swapped in a new drive, restored from last night's tape backup, and people could have continued working. Instead, because the machine was rebooted the whole department was down for several hours.

    The mantra stands - REBOOTING WILL NOT FIX THE PROBLEM. And if you reboot before you know what the problem is, then not only don't you know if it will help at all, but you also don't know if it will make the situation worse.

    sometimes getting back online as fast as possible is more important.

    That's the trap - there is no guarantee that rebooting will do this - and you might just be screwing it even worse.

    Getting back online as fast as possible involves solving the problem first - REBOOTING WILL NOT FIX THE PROBLEM.
  • by schon ( 31600 ) on Wednesday June 27, 2001 @07:52AM (#125444)
    it may have resolved the problem for a short while

    Even though you think you're saying the opposite of what I said, you've hit the nail squarely on the head - rebooting never fixes any problem.

    It may temporarily fix the symptom, but the problem is still there.

    It is possible for routers, Linux boxes, etc to crash.

    Yes, it is. But if they crash, it's for a reason - perhaps there is a bug in the configuration, or firmware; or perhaps it's hardware.. but what's important is that rebooting will not actually fix the problem, all it will do is temporarily alleviate the symptom.

    If the problem is with the configuration, then you fix the configuration. If there is a bug in your software, you fix that. If it's hardware, you replace the faulty hardware. If it's firmware, you upgrade the firmware (or replace the unit with a different model, from a manufacturer who actually does quality testing.)

    But you do not just blindly reboot - if a reboot is required, you do it after you've discovered WHY the machine has crashed, and you've fixed it. Once again, the mantra is "Rebooting will not fix the problem."
  • by schon ( 31600 ) on Wednesday June 27, 2001 @06:02AM (#125445)
    I laughed out loud when I read this:

    But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."

    You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."

    They usually respond with "but I didn't know what else to do."

    To which I answer "Repeat after me - REBOOTING WILL NOT FIX THE PROBLEM."

    "But I didn't know what else to do."

    "Then call someone who does - REBOOTING WILL NOT FIX THE PROBLEM."
  • by AtariDatacenter ( 31657 ) on Wednesday June 27, 2001 @06:55AM (#125446)
    Just wanted to say thank you for the explanation. After all, we are your customers! :) It is really nice to get an accounting of what happened.

    BTW: Are you going to plan any redundancy/failover drills as a result of this?
  • by Platinum Dragon ( 34829 ) on Wednesday June 27, 2001 @07:17AM (#125454) Journal
    If this is a "blow-by-blow" account, then could someone, I dunno, involved in the mess explain that little comment Taco made for about 20 minutes on Sunday about when the "qualified personnel" arrived, "[they] discovered that she wasn't actuually as qualified as we had hoped. Then she quit, thus terminating 3 local star systems."

    Was Rob just popping off at random, or was that little bit removed trying to cover /.'s ass in the face of a potential libel suit?

    Jes' wondering...
  • by mgoff ( 40215 ) on Wednesday June 27, 2001 @07:20AM (#125459)
    While technically correct, you have to look at the bigger picture. Rebooting may not fix the root cause of the problem, but it could very possibly get the system back online. Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.

    You can make a case that valuable troubleshooting info is lost when systems are rebooted. I agree, but counter that all good systems should have detailed event logging. Leaving the system online and intact is the best way to root cause a bug. But, sometimes getting back online as fast as possible is more important.
  • by BluSkreen ( 47256 ) on Wednesday June 27, 2001 @11:43PM (#125466)
    ...to all of us that do this for a living. Forget for a moment that most here have never set foot in a real data center, much less even own a server. No pros want to see another's network go down (well, most of the time ;-) ), and we don't want ours down. I've spent many an hour looking at an errant PIX, or troubleshooting some other network config. I know what those guys were going through. It sucks...

    Don't slack. When you slack it bites you in the ass. Maybe not today, maybe not tomarrow, but someday, someday soon, it will.

    Test your failover configs. How? By actually making them fail. During the maintaince window, power that primary router/firewall/load balancer down hard and see if the fail over works. It's like testing back ups, kids. You have to know they work before you need them.

    Realistically develop on call strategies. OSDN didn't really have a net ops staff of four. One had quit (why are they counted?), one was in hospital, and two had weak "couldn't reach my cell phone" excuses. That just don't work in the real world. If you are on call, you are on call. The "phone too far away" and "battery fell out" just don't cut it in the adult world of professional net ops. Get a satellite pager, and if you are on call, make sure it's on, and near you so you can hear it.

    Don't bash your employees/ former employees, particularly during a heated situation. Shows no class. Besides, if you are such a hot shit. grab that console and fix it. Otherwise, keep your mouth shut. Besides, who is in charge of making sure the people that are hired are qualified? Hmmm?

    Document your shit. It's not that hard. Visio can do much of it for you. I'm going to break an NDA here, but the Exodus Service Agreement states that all machines and cables are to be labeled. That is so when the dude (or dudette) has to leave the NOC and enter your cage to reboot your lame box, they know what is going on. Also works well for when you net ops staff is too concerned with getting drunk or laid and your poor programmers have to go in to fix the network.

    Some folks really went above and beyond, but it seems to me that the management severely dropped the ball.

    Is VA really ready to abandon the hardware market for software services? One has to wonder.

    Dave
    been there before...
  • by anticypher ( 48312 ) <[moc.liamg] [ta] [rehpycitna]> on Wednesday June 27, 2001 @07:57AM (#125468) Homepage
    Yes, but have you tried dialing that number when Slashdot wasn't down?

    Rumour has it the conversation went a little something like this:
    [Kurt] Hi, cisco tech support?
    [TAC] Yes
    [Kurt] this is Kurt at slashdot...
    [TAC] Oh my god, its about time you called us. You've been offline for nearly 24 hours, we're all going through withdrawls. Hang on a sec, our top techs are dying to help.

    I talked to a friend in cisco TAC (Brussels) who said that they regularly lurk on /., and in the TAC they could see it was a major network outage since the whole of the OSDN sites were unreachable. Nothing to do but wait, or answer calls from other customers :-)

    Since summer weather had come to Europe, I, personally, did not notice the outage. But I promise in the futur to not have a life.

    the AC

    [Note to Kurt and company, make sure you return your customer satisfaction survey. Those TAC folks live and die based on keeping a very high level of sat scores. I think they need a 4.85 (on scale of 1 to 5) just to keep their jobs within cisco, and a 4.89 to get a raise. So 5's across the board, and in the comments put a link to this /. story for their manager]
  • by NateTech ( 50881 ) on Wednesday June 27, 2001 @10:27AM (#125471)

    I've seen this scenario over and over again... one guy who knows and understands the network, ten people standing around at the equipment trying various silly commands to fix it when it's down...

    Here's some suggestions -- you probably already realize that 90% of your pain was avoidable, but everyone has to learn "the hard way" the first time, right?

    We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer...

    That's called bad documentation that no one ever reads.

    Get your networking guys to document TROUBLESHOOTING techniques and to teach the programmers how the network is acutally set up and why. You have plenty of talent capable of understanding how it all works there.

    Get more than one way (cell phone) to reach your most important network engineers. Pop for a guaranteed delivery text pager and ask them to carry that as well as the cell phone.

    Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords."

    Paper. Wallet. Put them there. Better yet, PGP encrypted password escrow somewhere that anyone can get access to, and a locked cheap fire safe at the office with the public and private PGP keys on a CD-R inside -- for just this type of scenario.

    So I asked the Cisco technician, Scott, to telnet into our switch...

    Bad bad bad... telnet = bad. Good network security always goes out the window when the network's down...

    So he's in the switch and he's disgusted and horrified by how we have it configured...

    This is probably the most important hint during your entire outage... your network people either don't know what they're doing, or you're not ALLOWING them to do their jobs, or they're understaffed, or whatever other excuses can be made up ... your call, but don't forget this -- if Cisco's "horrified" by your configs, there's a serious issue you need to find and correct somewhere in your organization. Everything from training, to documentation, to troubleshooting procedures needs a serious walk-through.

    The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...

    DO FAIL-OVER TESTING. If you'd have done a fail-over test of this config you'd have known it didn't work correctly during a nice scheduled time when your network engineers are available and at the equipment, instead of the middle of the night during an outage with all of them MIA. This is so easy to avoid.

    Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups.

    Nice of them to tell you. Who is the customer here again?

    ...he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging)

    Put a $20/month POTS line in your cabinet for goodness sake!

    That's enough... I'm appalled, but hopefully you will straighten out some things now that the site was down for an extended period. Done properly, network downtime should be a rare event, usually caused by human error, not by bad configuration.

    Many outages are unavoidable, your outage sounds like it was avoidable, and certain steps could have been taken to minimize the length of the outage.

  • by Tackhead ( 54550 ) on Wednesday June 27, 2001 @08:12AM (#125473)
    > Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.

    What you said.

    I did a bit of (very junior-level) sysadminning back in my day.

    First thing the BOFH told me was "Buy a hard-cover notebook. Not spiral-bound. Not softcover. Write down everything you do. Feel free to doodle and write obscenities if you like. Someday you'll thank me for this".

    I was a bit befuddled, and then he showed me his notebooks. Five years of dramatic fuckups and even more dramatic recoveries. His own personal "deja.google.com" (but it was 1992, and long-term USENET searching hadn't been invented yet, hell our office was using UUCP!) for everything he'd had to work out from first principles on his own.

    And thus was the PFY enlightened.

    (And yes, I did buy him a beer in late 1992, when something I wrote down in mid-1992 jumped off my page and saved my ass.)

  • by mjh ( 57755 ) <(moc.nalcnroh) (ta) (kram)> on Wednesday June 27, 2001 @05:47AM (#125480) Homepage Journal
    So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician.

    While I agree that I usually get someone at cisco who knows what they're talking about, it is very rare in my experience that it happens in only a minute, although it does occasionally happen. A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.

    All of that being said, I would have to agree that cisco's TAC is probably one of the best tech support groups I've ever worked with.
    --

  • by montey ( 60028 ) on Wednesday June 27, 2001 @06:44AM (#125482) Homepage
    Someone kind of elluded to this but MY GOD are your security procedures busted!

    Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.

    Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
  • This is exactly what we need on the 'net for us sysadmins to read. Failure stories. Why? You don't learn much from success stories, because things worked the first time.

    "Welcome to the HOWTO. My setup worked the first time. Why didn't yours?"

    Granted, noone wants to see stuff on the 'net go down (and we're glad you're back, /.) But writeups like this one and Steve Gibson's at GCR about the DDOS attacks are priceless. They show what people have tried, what hasn't worked, what did work, and definately where to start the next time.

    Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.
  • by DoomHaven ( 70347 ) <DoomHaven@hotmail.cCOMMAom minus punct> on Wednesday June 27, 2001 @05:13PM (#125492)
    I hereby propose the term "anne-tomlinson", or "tomlinson" to describe the act of departing a company in the most suspicious of circumstances, known only to a very privileged few. Used in the following example:

    X: "What happened to Anne?"
    Y: "I don't know; all I know is that she anne-tomlinnsoned from work."

    Note that this verb should have the subject of the remark used as the subject of the verb, and the organization left as the indirect object. This should be adhered to regardless if the subject quit, was fired, laid off, died, disappeared, never existed, or there was a mutual decision for the subject to leave. In fact, the verb should mainly be used when the method of departure is unknown or never officially stated (or, even officially acknowledged).

    Also note that this verb should NOT refer to a person leaving another person, as in "Fred's now-ex-wife had tomlinsonned from him." The number of people (one or more) that are the subject should be less than the number of people who the object represents.

    Continuing on, this verb should NEVER be applied in a self referential matter, IE: "I anne-tomlinsonned from them". This implies that the subject either A) knows the reasons, and is just being a prick about not stating them, or B) the subject does not know the reasons due to massive thick-headedness.

    Lastly, this term should only be used to convey the sense of inpenetrable mystery surrounding the departure. It would be oxy-moronic to state: "Ted tomlinsonned because he was bored and wanted to leave." If the mystery surrounding the departure is penetrable, use another phrase.

    anne-tomlinson, v,: to leave or be removed from a group under extremely odd, and mysterious, circumstances; especially when the actual method of departure or initiating party of departure is unknown. More especially, when the actual departure is apparently covered up or left un-acknowledged.
    tenses: anne-tomlinson, anne-tomlinsons, anne-tomlinsonning, anne-tomlinsonned, had anne-tomlinsonned.
  • by scoove ( 71173 ) on Wednesday June 27, 2001 @05:55AM (#125493)
    Well, we have a $LITTLE_NUMBER support contract with Cisco, and have had similar with two previous companies.

    Our results were much the same. Very, very responsive people.

    I have to agree with Taco, if they gave this kind of service down at the DMV, they'd be picking up passed out folks left and right.

    *scoove*
  • by Greyfox ( 87712 ) on Wednesday June 27, 2001 @07:10AM (#125509) Homepage Journal
    I mean, you're never supposed to get competent support on the 800 tech support line! The guy you talked to is probably due to be transferred out to some other department because he's obviously way too smart to be anywhere near a phone.

    Wish the companies I deal with on a regular basis ever showed that level of skill when I need help. well... hmm... actually Speakeasy is generally pretty good about accepting that my problem is accurately diagnosed and figuring out what's wrong. And Viewsonic the other day was able to provide refresh-rate specs on a monitor I wanted to order within about 60 seconds of my placing the call (Though they dropped the ball by not having the specs I wanted available on their web page) What is this trend of good service? It's scaring me...

  • by inburito ( 89603 ) on Wednesday June 27, 2001 @01:11PM (#125513)
    You made your point with capitalized letters but still.. If everything has failed and you have nothing to loose why not give it a shot? This is exactly the situation that these people were in! It might not work and you've lost what, a couple of minutes.. But, it might work and your day is half saved..

    Just because someone screwed at your work doesn't make your mantra a universal rule.. Especially when dealing with something like a router or a switch.. These things are normally not meant to be user serviceable and will take a reboot just fine(no hot swappable drives there).. You could have hit a 1ppm problem and rebooting just brings everything back online until statistics kick in again. Little uptime is better than none.

    Sure it won't fix anything per se, but getting things normalized enables you to start concentrating on the problems at a less hectic pace..

  • by cheezus ( 95036 ) on Wednesday June 27, 2001 @10:39AM (#125516) Homepage
    Rob, on behalf of all the script kiddies out there, I would like to thank you for disclosing exactly how OSDN's network operations are set up :)

    ---

  • I'm not sure I understand that. Why does the router purge its logs when you reboot it?

    That sounds lame as hell. (Granted, though, configuring a Pipeline 50 goes right over my little bow head, much less a Cisco. So yes, I'll stipulate that I'm talking out of my ass here.)

    The act of rebooting should be just another even that gets logged, NOT a synonym for "oh, and by the way, you can delete the old log file now."

    IMHO log deletion should be done on a calendar basis; everything more than x days old gets purged automatically. What's Cisco's rationale for auto-deleting logs during the boot process?
  • by Dr_Cheeks ( 110261 ) on Wednesday June 27, 2001 @06:19AM (#125536) Homepage Journal
    Yeah, they made such a big deal out of how rubbish she was (and later re-worded things to sound less mysoginistic); basically pointing the finger at the support people, but this write up is all glowing adverts for Cisco and doesn't mention anything like that.

    C'mon, tell us the full story!

  • by Dr_Cheeks ( 110261 ) on Wednesday June 27, 2001 @06:50AM (#125537) Homepage Journal
    Hey, try browsing at -1 nested - seems like everyone who's questioned the story about the woman who "quit" (Anne Tomlinson?) has suddenly been modded down. I'm not usually one for conspiracy theories, but is this surprising anyone else? I think that the question couldn't really be more on topic, and they're hardly flamebaits or trolls - what's up?

    BTW, feel free to mod me down, prove my point and compound my paranoia; I've got karma to spare : )

  • by petard ( 117521 ) on Wednesday June 27, 2001 @07:49AM (#125552) Homepage
    We're a very small installation and get similar response. If you've paid for a support contract and it even smells like a router problem, the fastest way to fix it is to call them right away. They are a model tech support organization.
  • The only thing worse than having no backups/redundancy is having backups/redundancy that you think will work, but, in fact, don't.
  • by Lord Omlette ( 124579 ) on Wednesday June 27, 2001 @08:40AM (#125564) Homepage
    I will now prove, using extremely shaky methods, that "Blow-by-Blow Account of the OSDN Outage" by Roblimo is, in fact, an epic myth.

    I. Call to Adventure

    "By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond."

    II. Meeting the Mentor

    CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."

    Whoops, that's not it.

    "So I called Cisco tech support."

    There we go.

    III. Obstacles

    "Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you."

    IV. Fulfilling The Quest

    "He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird."

    V. Return of the Hero

    "The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers."
    "Tuesday was router reconfig day."

    VI. Transformation of the Hero

    "At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be."
    "We certainly aren't going to make the same ones [ed: mistakes] again!"

    Peace,
    Amit
    ICQ 77863057
  • by Drone-X ( 148724 ) on Wednesday June 27, 2001 @06:22AM (#125576)
    So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.
    Yes, but have you tried dialing that number when Slashdot wasn't down?
  • by CaptainZapp ( 182233 ) on Thursday June 28, 2001 @12:40AM (#125599) Homepage
    As much as I agree with you, even to the point that there is ultimately an event triggering nasty things, a reboot can (granted, very rarely) solve the problem.

    Take a relational database for example; there is so much, that can go wrong with it. For starters, there are bugs in such complex products and fixing them (save for Postgresql) is beyond your control.

    But it must not even be a bug in the database code. It can be something in your network component (we chased cases for month which turned out to be a DECnet issue, but where attributed to the database server), it could be the fact that the db vendor compiles his product on multiple platforms and it's virtually impossible to test every functionality of a new release on every supported platform. Yes, I know that in an ideal world this should be done, but it isn't.

    Assume it would be possible to perform such tests. Save for propriatery (or semi propriatery) architectures like OpenVMS/AXP you can have so many different hardware- and network components, that it's just not possible to forsee all eventualities.

    After ruling out such possibilities, we're not there yet: What are the query characteristics, how many concurrent users do when, what. What front ends do they use, how are they connected. The problem may even be caused by a component that has nothing to do with the database engine (Access front end, anyone ?)

    Although the fundamental cause for the problem might never be detected a reboot of the data server might fix the problem and it will never occur again, since the same combination of factors occurs so rare that it's even impossible to reproduce the problem.

    However, the [alt-ctrl-del] attitude of younger IT folks (specifically those that grew up in a PC environment) makes me barf and indicates just how clueless a lot of those folks are. You never reboot a productive IT component, unless there is no other choice or in the context of your normal maintencance cycle (memory leaks do occur in software)

  • by sulli ( 195030 ) on Wednesday June 27, 2001 @09:20AM (#125607) Journal
    Flower is right.

    Either Anne is real or she isn't. If she's real, this is an internal matter that we really don't need to interfere in. If, as the "Anne" poster suggested, she quit because Taco and Hemos are hard to work with, she was within her rights and should get at least some support from a community which often says "Quit! Now!" to Ask Slashdots about PHBs.

    If she's not, this is all a big waste of everyone's time, and possibly the best troll we have ever seen on slashdot. (An account by that name [slashdot.org] has a brand new uid (462836) and zero comments.) Think of the trolls you've posted - how many led to 100s of posts on other threads, conspiracy theories galore, and posts by #1 and #2? Whoever did this (if not Anne) should get mad props from the troll fans, but should not take any more of our time.

    My bet is that she's not real. But in either case we should drop it and get on to more important things.

  • by sdo1 ( 213835 ) on Wednesday June 27, 2001 @08:40AM (#125620) Journal
    OK, so the config was a mess. But it was like that BEFORE the outage, right? So what happened between "running OK" and "we're down" to cause it to fail? I didn't see anything to explain that in the report. Or maybe they don't know...

    -S
  • by sdo1 ( 213835 ) on Wednesday June 27, 2001 @05:52AM (#125621) Journal
    Just don't name the router "Kenny".... he dies every week.

    -S
  • Are we really to believe that nobody was available for 48 hours?

    Everything failed at about 7AM Sat. Dave was at Exodus between 8:30 - 10:30AM Sat (didn't look at the log book when I got there). Kurt arrived shortly after that I belive (again I didn't look at the log book). I arrived there around 11:40AM. Sat.

    And yes my battery was lose on my Nextel. Just takes a little pressure upwards to lossen the batter on the i1000plus I have. The batter doesn't fall out, just loss enought so it lose contact and turns the phone off.

    I have now taped the battery in place!

    Yazz Atlas

  • by Leon da Costa ( 225027 ) on Wednesday June 27, 2001 @06:06AM (#125634)
    > (we have 4 T-1's...some ask why don't you go with a OC-3? 4 T-1's are probably cheaper and provide redundancy...nuff said).

    ehr... 4 T1's = 4 x 1.544 Mbit. (=6.176Mbit... der)
    1 OC-3 = 155.52 Mbit.

    Not really similar, eh :-)
  • by American AC in Paris ( 230456 ) on Wednesday June 27, 2001 @06:50AM (#125644) Homepage
    All right.

    After having been modded down next to the goatse links, somebody please explain to me how the hell we're supposed to discuss the decidedly strange disappearance (and subsequent reappearance) of this story on the site without getting modded as "offtopic"?

    Just where, exactly, are we to discuss this little point? For example, why did this story disappear? Was it technical? Was it editorial?

    For a group that is so damned keen on openness and truth, it strikes me as somewhat ironic that several dozen mod points have been used to effectively supress this part of the thread.

    I want to know what happened. Others do to. If you can't give us a decent place on Slashdot to discuss this issue, then don't mod us down as offtopic!

  • by American AC in Paris ( 230456 ) on Wednesday June 27, 2001 @09:19AM (#125645) Homepage
    Thank you, Hemos. I was kinda guessing and hoping that it was something technical.

    You'll understand my consternation, though, upon seeing my (admittedly offtopic) post on the Shared Source article regarding the disappearance of this article modded down three points to -1 in the course of roughly one minute, and the seemingly similar fate of a good many other posts like it. Also worth noting is the fact that this article touches on what many would consider a rather sensitive issue with the OSDN and /. crew right now. I don't like conspiracy theories much, but I'll be damned if the situation didn't seem, well, rather odd.

    Might I suggest, though, that once stories are actually posted to the front page, they remain as is, even if the order of presentation is not the most desirable? Consistency is key, and having articles disappearing from the front page is not terribly consistent.

  • by Bonker ( 243350 ) on Wednesday June 27, 2001 @05:54AM (#125650)
    Security through obscurity is no security.

    No matter how FUBAR'd your router/switch/firewall configuration is, it's still no serious obstacle to crackers, Robin.
  • by plcurechax ( 247883 ) on Wednesday June 27, 2001 @07:38AM (#125652) Homepage
    If you haven't heard of it before, get your browser over to RISKS digest [ncl.ac.uk] or comp.risks [comp.risks]. Forum On Risks To The Public In Computers And Related Systems

    I discovered the RISKS digest when reading the about software engineering, and it has certainly helped me think about failures and recovery when designing and building systems.

    There is also the underlying thesis in the article about how complexity, whether in a bungled redundant network connection or just a large poorly documented, poorly tested, and poor configured system is your enemy in building reliable systems. A lot of systems were built like Slashdot during the dot.com IPO craze, I wonder how many of us rely on such poorly built systems?

    Building complex, reliable networks is hard, and expensive. About 3 times what your estimate is, which is about 2 times what you boss expects to spend.

  • by rixster ( 249481 ) on Wednesday June 27, 2001 @07:28AM (#125653) Journal
    if (comment like "%girl%" or comment like "%What happened to%" or comment like "%original story%") update posting set score = -1, reason = random("Troll","Offtopic")
    Is there someone else outthere hosting a site where we can have a non-biased discussion ?
    BROWSE AT -1 Checkout how many posts have gone straight in at -1 (and this one too will, I betcha...)
  • by Miss Tress Race ( 309097 ) on Wednesday June 27, 2001 @07:13AM (#125673)

    Hey, try browsing at -1 nested - seems like everyone who's questioned the story about the woman who "quit" (Anne Tomlinson?) has suddenly been modded down. I'm not usually one for conspiracy theories, but is this surprising anyone else? I think that the question couldn't really be more on topic, and they're hardly flamebaits or trolls - what's up?

    Yes, I'd like to know too. I didn't see the original "Slashdot Back Online" announcement, but I did see Maswan's listing of the 3 different versions of it [slashdot.org], with all reference to the "she wasn't actually qualified" girl removed. And of course, the message from Anne Tomlinson [slashdot.org] decrying her treatment at the hands of Jeff and Rob. And an astonishingly rare post from CmdrTaco [slashdot.org] (his only post in the last several weeks) dismissing her as a troll (and being of course sycophantically modded up to 5).

    So I, and no doubt many other loyal Slashdot readers, would like to know - what really happened? Who is Anne Tomlinson? Why did she quit? Why has all reference to her been purged from the site? Why is everyone who asks about her being modded down to -1 so quickly that it is obviously editors doing it?

    We have a right to know.

  • by aldjiblah ( 312163 ) on Wednesday June 27, 2001 @06:18AM (#125676)
    Quoted from the original Slashdot Back Online [slashdot.org] article (before it was modified): "And when our qualified personel arrived, we discovered that she wasn't actuually as qualified as we had hoped. Then she quit, thus terminating 3 local star systems."

    Where does this mysterious woman fit into the story above?

  • by sehryan ( 412731 ) on Wednesday June 27, 2001 @05:54AM (#125685)
    what happened to the woman that quit?
    -
    sean
  • by vwpau227 ( 462957 ) on Wednesday June 27, 2001 @06:05AM (#125705) Homepage

    I remember when I started out in computer networking (and it didn't seem like it was that long ago), I was told this by one of the other technical members of our team, something that I haven't forgotten: redundancy in a system is necessary not only in the hardware and software in that system, but also in the resources that are used to keep that system running (that includes of, course human resources, as well as power HVAC, and so on).

    Too often, the human part of the redundancy equation isn't totally factored in. When you don't put all of the human factors into the redundancy equation, you have a redundant system isn't really redundant.

    Of course, it helps if you have a vendor that will work with you (and those of you who remember working with Novell servers in "the old days" know what I'm talking about, too).

BLISS is ignorance.

Working...