Blow-by-Blow Account of the OSDN Outage 389
Our network operations staff was shorthanded; one of our most knowledgable people had quit recently to go into business with a friend and had not yet been replaced. Another was in the hospital, ill and unreachable. A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one. Our netops staff is typically "on the bounce" 24/7.
Dave Olszewski, an OSDN programmer who is not technically part of our netops staff and is not trained in our equipment setup, happened to be on IRC at the time. He doesn't live far from the Exodus facility in Waltham, MA, where our server cage lives, so he went there immediately. Kurt Gray, lead programmer, who we dragged out of bed, was not far behind. Hemos and others were awake by then, growing frantic as we found that not only Slashdot, but also NewsForge, freshmeat, OSDN.com, ThinkGeek, and QuestionExchange were down, along with our old -- but still popular -- MediaBuilder and AnmationFactory sites. Arrgh!
This is Kurt's "on the scene" report from Exodus:
Headshaking all around. Meanwhile, about 11:40 a.m. Yazz Atlas woke up and got his cell phone reunited with its battery. He picked up his voice mail messages, tossed on clothes, and hustled over to Exodus.Walk into our cage at Exodus and it seems harmless enough but try to learn what everything is doing and where the wires are all going in less than an hour and you could go insane. You're standing in a nice, clean, uncomfortably air-conditioned facility with 150 of VA's FullOn and various other servers humming away. Greeting you at the door is "Big Gay Al" our Cisco 6509, which contains two redundant router modules: Kyle and Stan. If Stan dies, Kyle takes over and vice-versa. Across the cage are two Arrowpoint CS800 load balancing switches: one is racked and idle (as a hot spare) and the other is live and balancing the load for most of our OSDN web sites. Between the Cisco 6509 and the Arrowpoint is a bridging FreeBSD firewall using ipfw rules to block stuff like ping just to drive everyone nuts basically."I can't ping your site!"
"Yeah, we know."
Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you.
At this point if you know anything about networking you'll demand an explanantion for why we're using each piece of equipment in the cage and not a WhizBang 9000 SuperRouter like the one you've been using flawlessly that even washes your dishes for you and makes food taste better too... I can only tell you that I'm not the networking design person here, I didn't chose this equipment or configure it but I'm told it's very good hardware as long as you know what you're doing, but as CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
"Did you reboot the firewall?" I asked Dave.
"I rebooted everything," he said. "I think's it's the Cisco."
So we console into the Cisco 6509. What a mess. Neither of us understand how this switch was configured and what it is trying to do. We don't fully understand why you can get a console connection on Stan but not Kyle (turns out the standby module doesn't have active console, that's normal).
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords." The op who was supposed to be on duty (the one whose phone was out of hearing) was still nowhere to be found. They called their hospitalized coworker and got the Cisco passwords.
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at. We could ping something on the inside but not everything. On some VLANs we could ping the gateway and others not. The outside world could ping one of the IPs the 6509 handles but not the other. From the inside we could not ping the IP that the outside world could ping. We could ping the one that they couldn't...very frustrating..."
Kurt again:
The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers. Instead of getting an answer from Exodus and running to Cisco with it, and then back again, he got Cisco and Exodus engineers to talk directly to each other and work it out. He conferenced an Exodus network engineer to Barnaby at Cisco and, Kurt says, "they talked alien code about VLANs, standby IPs, HSRP, multihoming, etc. etc., and they came to an agreement: our switch config was a mess... but at least Barnaby knew what the settings were supposed to be and an Exodus engineer agreed with him."Several hours of this sort of network debugging went on until 3:00 AM Sunday. By then we had called Cisco for help. They couldn't help us until they saw the switch config and got a chance to review it. We were spent. We had to go to bed and stay down for the night.Next morning we're back at Exodus and the situation hasn't changed -- our network is unreachable to the outside world. I was hoping that during the wee hours of the morning the Cisco 6509 had become sentient and fixed its own configuration or perhaps a friendly hacker had cracked into it and fixed it for us, or perhaps ball lighting would travel down a drain spout and shock our cage back to life like those heart paddles paramedics use... "It's a miracle!" No such luck.
So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.
So I asked the Cisco technician, Scott, to telnet into our switch and take a look at the config. I figured he'd balk and say, "No I can't do that," because of course this is a tech support number I called so he's going to tell me to give the phone to my mommy if she's there and ask her to log into the switch because, since I don't have a lot of experience with IOS, I must be some kind of idiot to even call tech support without knowing what my HSRP configuration is on VLAN 4. Instead he says, "OK, what's the login password?" I can't believe this... I must have dialed the wrong number, he's not going to just go into our switch and sort this out for me right here and now, is he?
So he's in the switch and he's disgusted and horrified by how we have it configured, and I'm sure he's right. So I ask him, "Well, can you change all that?" I figure he'd say, "No, this your equipment, you fix it yourself," but he doesn't, he says, "Sure, what's the config password?" You gotta be kidding me, I must have dialed the wrong number here... this cannot be a tech support line... you can't actually get a tech support rep on a toll-free number simply to log in and fix your router setup while you whine at him on the phone... this is not real.
So he's in the switch config and he's having a great time pointing out everything some of our people warned us about months ago. He tells me this is wrong, we shouldn't be doing this or that... "Well, then change it if you don't mind," I tell him. "Switch broke. Me dumb. You fix." ...so at one moment Scott wanted to undo some changes. He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird.
Ok, that's all fine, but Scott is still freaked out about how we have the switch configured. Soon I get a call from Barnaby, another hot shot Cisco tech rep. He just logged into our switch and he's horrified too. He wants to walk me through a total switch upgrade and cleanup right now. "Not tonight", I tell him, "I'm burnt and I need to consult some some network people over here before we mess with this any further."
Before moving on to the (short) Tuesday outage, here are a few more notes from Yazz:
Tuesday was router reconfig day. It was originally only supposed to cause "about five minutes" of downtime, so it didn't seem worth posting any kind of notice that it was going to happen. Why the middle of the day instead of a low-traffic post-midnight time? Because this way, if there was any trouble lots of people at Exodus and Cisco would be awake and around to help. And it was a good thing this choice was made. Kurt picks up the story:The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups. Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there... Fun!!!
This has not been OSDN's finest week. But we thought it was better to give you the full rundown than try to pretend we're perfect. At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be. If nothing else, perhaps this story can help others avoid some of the mistakes we made. We certainly aren't going to make the same ones again! (~.*)Tuesday 11:00 a.m. we're back in the cage. Barnaby is logged into our switch while he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging), helping us by upgrading the Cisco 6509 firmware, then he's going to clean up the config. First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops). Yazz took care of that. From there Barnaby patched the firmware, had me reboot the switch, and we should be down for just 5 minutes. Unfortunately 5 minutes turned into 2 hours.After the switch reboot part of our network was unreachable again, much like Saturday's episode only this time with a Cisco rep on the phone helping us work it out. Again we started tracing cables all over the cage, pinging every corner of the matrix. Barnaby got an Arrowpoint tech rep, Jim, on the line and into our Arrowpoint. But this is tech support, Jim isn't just going to log into our Arropoint and debug for it for us, right? Wrong, this is Cisco tech support: Jim logs into our Arrowpoint and works with Barnaby to trace packets and debug our network.
For a while we put a cross-over cable in place of the firewall just to be sure the firewall box wasn't jamming us. Nope. Didn't help. Barnaby and Jim are mapping hardware addresses to IP addresses to figure out where each packet is going. Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
"Hold on," he says over the phone to me. Then the light goes green, and after a few seconds of routers correcting their spantrees we're back online. Everything is back online. All this time it was this little interface to an ignored switch that none of us bothered to account for. Make a big note about in the network documentation, please.
After we came back online Barnaby went ahead and cleaned up our switch configuration, put things the way they ought to be, made our conections sane and stable.
Cisco couldn't buy advertising like this... (Score:2)
Well deserved though.
Re:Beware of departure from original statement (Score:2)
Re:Cisco Support (Score:5)
Many people who call don't understand how the system works internally so here's a summary: We have cases in 4 groups, priorities 1 through 4, 1 being the most important. The designation of the priority of the case is entirely up to you as a customer. All cases are P3s by default which more or less means they need resolution within 72 hours. If your network is down and you need help right now, today with no waiting we'll elevate to a P2. If you are in a serious network situation like the one described in the article then it's a P1 and literally everything else stops, a bell goes off and everyone crowds around the tech w/ the problem (unless it's a softball case).
There are TACs all over the world but for English-speaking customers what usually happens is the US TACs roll over to the Australian TACs in the early evening who in turn roll over to Belgium and then back to the US. P1s get worked 24 hours until they're resolved, and if they're not fixed in less than 4 hours it's not so good for us.
We have to close about 5 of these cases a day which is sometimes cake (I can't ping my interface which is shut down) and sometimes nasty (redistribution 12 times over).
Also, those little surveys you get everytime you work with us (Bingos) are very important. If you'll recall you can rate us from 1 through 5 in 8 to 10 different categories. Anyone who doesn't maintain an average of at least 4.59 is not long for the TAC, 2 or 3 months tops.
The pay is actually kind of crap but there's no better place in the world to prep for your CCIE. I don't think anyone views the TAC as a long-term environment. Too much stress honestly.
Re:Grr... (Score:5)
Not everything is a conspiracy folks.
Re:cisco tech support are badasses (Score:4)
- Robin
Re:You know you've been using windows too long whe (Score:2)
Re:You know you've been using windows too long whe (Score:2)
Re:More Writeups Needed (Score:2)
Rob, why would you throw Slashdot into this mess? (Score:2)
You put our favorite news engine in the middle of a routing mess that the network engineers had been warning you about for months?
What were you thinking?
You must be able to find a nice, comfortable colocation site somewhere.
Re:Eeep - scary moderators! (Score:2)
..and then she was erased from the latest "official" version of the story. What the fuck is this? This isn't a "blow-by-blow account", it's a service pack to fix the "bugs" in your last account of what happened!
Go on! Mod me down to -1 again! You'll have to do it a few times before I go below the "post-at-2" threshold!!
Re:Eeep - scary moderators! (Score:5)
Within a minute ??? (Score:5)
Was anyone else waiting for the "*clickity-click* Wow, it looks like your entire root directory was deleted!" punchline? :-)
Re:Cisco Support (Score:2)
Oracle are a big company, and vary hugely in the support they give you. I've had situations where I've been given the runaround, like you. Getting passed from extension to extension, explaining my problem over and over again, "oh, umm, we don't do that stuff here, call this number..." and finding out that Bob's on holiday and his secretary has no idea who else I could speak to...
I've also had situations where Oracle have said our engineers aren't sleeping until this gets fixed, and a few hours later there's a motorcycle courier at my door with a gold disc containing a brand new build of Oracle with the bug fixed. I've had Oracle techs ssh into my servers, I've had the come to the data centre with mysterious CDs containing Oracle software that they don't let outsiders have, and that they erase from your machine once they're done.
Helps to have (or at least have access to) a high-end support contract, tho'. If you're some kid downloaded 9i onto his Red Hat box, forget it.
Re:I just have a couple of questions (Score:2)
Uh... TFTP uses UDP, which is a connectionless protocol, you can of course transfer files over more hops, but keep in mind, the more routers, etc you have in the middle, the more chance of a packet being dropped, and one packet can mean quite a bit when your transfering a new IOS image to your cisco ;)
Now it's been quite some time since I've looked at the TFTP RFC but I'm pretty damn sure it has the capability to request a block be retransmitted in the case of a timeout (packet loss). In fact, I'm sure of it; during the upgrade a few '.'s were noticed amongst a ton of '!'s and the checksum still worked out.
I forgot to add one thing (Score:3)
Was this configuration ever tested?! It sounds like it was put together, prayed over and sent out into the world.
it would have been simple to test too... pull out one of the uplinks... then the other... now try pulling out some of the webservers... and so on.
I just have a couple of questions (Score:5)
By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem.
Reboot the database?? WTF? You just proved my point as to why MySQL is NOT ready for primetime. Reboot the fscking database??
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
Guys, this isn't Windows -- Rebooting is an absolute last resort and if it works then you have discovered a problem, either in hardware or software and it needs fixed, not just a "oh well, a reboot fixed it, life goes on." Bastions of professionalism you're not.
I don't normally flame people for this kind of thing but the Slashdot crew are especially keen on bashing Windows, yet you resort to their exact tactics whenever a problem comes up.
Reboot the database?? I still can't believe I read that. Sorry.
Cisco Systems have some wonderful systems -- Hell I just recently found out about their stack trace analyzer... feed it a "sh stack" and it emails you back a list of IOS and/or hardware bugs which likely caused the crash. That is just plain old SCHWEEEET. Or being able to read their memory mappings to find out what is causing a bus crash... Ideal. You don't just randomly reboot the damn shit to try and get it to work. If it isn't working something is causing it. Embedded systems are generally pretty good at throwing up the red flags; you just need to look for them (logs, stack traces, extensive use of the debugging facilities...) Use the tools at hand instead of the big red button!
First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops).
Unless this is something specific to the IOS or router, that's bullshit. I just upgraded 5 AS5248s to IOS 12.1(9) with a TFTP server that is 8 hops away. I'm not aware of any TTL issues with TFTP.
Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
You mention that your network documentation is shitty -- I sure as hell hope you'll push to have it upgraded and maintained with a high degree of readability. Even complex systems do not have to be undocumented just because they're complex. Use pictures, use words. I haven't found anything in IT which cannot be explained by a combination of both. And throw in a glossary for the non-techies like yourself who are called upon to fix it. :-)
Don't get me wrong; I'm glad you're back up. But this could have been prevented. Very easily from the sounds of it. I hope you did fire your cisco admin; it sounds like s/he didn't have a clue and was too terrified of losing his/her job that s/he didn't ask for help. Cisco has mailing lists, tons of documentation and there are many IRC channels to ask for help.
Re:OSDN, Audit ALL of your systems NOW. (Score:5)
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Up until recently you had no choice but to telnet to Cisco equipment. I came up with a quick solution: deny telnet from anywhere but a same-segment computer (in our case, it's our RADIUS authentication box). Now ssh to the server and telnet from there to the NAS. Problem solved. :-)
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
While I usually agree, sometimes it is necessary to do a quick check. Even with the number of blackhats out there the chances of them doing anything signficant (or anything at all) for the 2-5 minutes you have the firewall out are insignficantly small.
Re:Cisco Support (Score:2)
Re:Cisco Support (Score:5)
I can confirm this. I've been a network consultant for almost a decade, primarily as a Cisco router/switch jock. I've dealt with the TAC (Technical Assistance Center) too many times to count.
Hold times can vary, depending on time of day, but are never as bad as the stories from other companies. In most cases, you are on the phone with a real, live engineer within 5 minutes.
90% of the time, the engineer you are transferred to will be able to get your problem corrected. On the few occassions where they have not been able to help me, Cisco has moved mountains to get the right people invloved. I had an issue with Serial SNA - DLSW+ encapulation last year that was escalated to the point where the guy that wrote that portion of the code for IOS was on the phone, and was prepared to come to my client's site (True, they had purchased about $8M dollars in hardware...).
You do, typically, have to have a Smartnet contract, but as other posters have pointed out, if the problem is not hardware related, they will generally help you straighten out your configurations even without the contract.
Alot of people like to make comparisons between Cisco and Microsoft. Anyone who has dealt with the two will be quick to dispell any similarities. Cisco is a first-rate organization, with first-rate support, and I've made a career out of working with their products.
Re:Cisco Support (Score:2)
That said, I've also had low priority cases where they don't respond for weeks; It's almost to the point at times that anything I've opened gets opened at Medium priority (business impact) or higher.
Re:Cisco Support (Score:2)
OTOH, I also had the experience of a TAC rep spending 2 hours on the phone with a competitor's tech support line, explaining to them why their config wasn's working. He was right, too.
A good long-term sales tactic, though: guess whose product I specified the next time.
I do wonder what will happen to the quality level at Cisco TAC with the recent layoffs, though. The first sign of impending doom at both WordPerfect and Novell was when the tech support quality suddenly headed down the tubes.
sPh
Re:Cisco Support (Score:3)
Caveat: Cisco basically does not have first level support (i.e. "'Is the router plugged in?' 'What's a router?') - you are supposed to have second level knowledge and have completed the first level troubleshooting before you call TAC.
But - I have been out of the office and had brand-new network techs call Cisco with a problem, and they did help out even then.
sPh
Will gladly confirm... (Score:2)
They really ARE this responsive.
Re:blocking ping, btw, is STUPID. (Score:3)
Slashdot outtage - graphs and stuff (Score:3)
Go to the monitoring system page. [sysorb.com]
Click the www.slashdot.org link
Select services
This will give you some graphs showing the outtage.
Consider the time (Score:4)
Well, in this case Slashdot was down. That can explain the instant response.
__
Re:Beware of departure from original statement (Score:2)
Re:Cisco Support (Score:2)
Cisco has _the_best_ customers service that I have ever seen. It is good enough, that I don't mind paying a bit more for the hardware, because I know that if it breaks, there will always be someone to help me out.
And, I don't work for cisco
Re:You know you've been using windows too long whe (Score:2)
Not always true. I used to admin JSP-based web servers. My experience is that the Java virtuals machines that server jsp pages have a way of starting to act funny. Stopping and restarting the services fixes the problem.
If I was ever building a network, I would not allow JSP to be a part of the network for this very reason.
Then again, if a JSP guru knows what can cause a JSP engine to act wonky, or how to set up a JSP engine so it is stable and doesn't need reboots, please post a follow up describing how to do this.
- Sam
Re:Anne Tomlinson? (Score:2)
Of course the original story [slashdot.org], or, I should say, some of the versions of the original story (how often can you rewrite the original and it still be the original?) mentioned "...when our qualified personel arrived, we discovered that she wasn't actuually as qualified as we had hoped. Then she quit..." which doesn't sound like someone who was already not working there anymore before the troubles started, so I assume that we're talking about 2 different people here, only one of which was identified one way or the other by sex/gender.
Quite a ways down in the responses to the aforementioned "original" story is an AC post [slashdot.org] signed Anne Tomlinson that seems to give another perspective on the events that weekend. It's a little ways down the page from another post [slashdot.org] that has some of the different versions of the original story.
Re:Not the issue (Score:2)
Re:Not the issue (Score:2)
Competent tech support! (Score:2)
Yike, I say. Yike. Competent tech support does not exist in this earth. What planet is Cisco on, and to what worthy cause can I donate money to see that humans never send a manned mission there and pollute this fascinating superior alien culture?
--G
Re:Kenny (Score:3)
Szo
Like reading old issues of the RISKS digest (Score:2)
--Jim
Re:Grr... (Score:2)
Re:hang on, i can do this... (Score:2)
I'm working on a web site to expose this travesty to the world. I'm sure everyone will be impressed with my esoteric knowledge of this classic of Japanese animation.
And in financial news... (Score:4)
"It's the first the the Slashdot effect has been a productive one", said an unnamed Cisco official, pausing briefly to dodge a large bag of cash sailing through a nearby window.
Jay (=
Re:Cisco Support (Score:2)
At an old job we had a wee Cisco 1604 router, just doing ISDN for our /24 (at the time ISDN was the only affordable thing in our area)
I had a problem with something and mailed Cisco. No more than an hour went by and I had email from a real life person in front of me telling me what to do to fix our problem.
Cisco isn't cheap, but you do get what you pay for.
grubRe:Within a minute ??? (Score:2)
---
Often true. Maybe usually. Not always. (Score:3)
It is clear that they were out of their depth. It is clear that they didn't know what they were doin. They knew that they didn't know what they were doing. But the experts were unreachable. So they tried something that sometimes works. I really don't see how you can fault them for that. It would, of course, have been better if they had know what their choices and options were, but they didn't.
I wouldn't have either. Probably most of us wouldn't have.
Caution: Now approaching the (technological) singularity.
Re:blocking ping, btw, is STUPID. (Score:2)
Re:OSDN, Audit ALL of your systems NOW. (Score:2)
Preferrably encrypted login should be used, of course. Be it ssh, telnet-ssl or whatever.
--
Re:OSDN, Audit ALL of your systems NOW. (Score:2)
Of course.
You are also protected if exploit code is run (say via a buffer overflow that changes hosts.deny).
uh? That sounds pretty damn unlikely. The bufferoverflow could just as well execute a reverse-channel back to the attacker. Of course, you limit the possibilities of the attackers. However, you're now already talking about running services with known vulnerabilities.
Firewalls can also protect against low-level attacks that don't attack the services/applications themselves.
That is better done at core-routers.
When properly configured, firewalls can be invaluable in logging traffic and otherwise keeping out unwanted traffic and IP spoofs -- and can do a far better job than simple packet filtering on a router.
That is better done by snort, or any other decent IDS.
I think it's pretty poor form to call someone else a dimwit when you're lacking a lot of info yourself. There's a reason that a firewall is industry-wide best practice for an Internet site or user network, and it's not because we're all dimwits
I regularly call those that thinks running firewalls is the be-all or end-all of security for dimwits. Unplugging a firewall on a network you know isn't exactly a horrible thing to do.
A Firewall is a good thing to have when you've got a network you don't have time to audit, and that doesn't have people to audit it on a regular basis. Its a good thing to have when you've got servers which you don't have any possibility of patching, or upgrading -- but that needs to be running some services (nonvulnerable) to the internet.
Of course, you could do lots of these things with NAT-devices. (Which of course isn't a perfect solution neither).
Blargh, I could rant on forever.
--
Re:OSDN, Audit ALL of your systems NOW. (Score:3)
Bah, you're talking without knowing the parameters. For all you know, they could've enabled the telnet access on the outbound interface specifically for the checking/cisco rep, disabling it afterwards.
Secondly -- if I remember correctly you can have pretty damn long passwords on ciscoequipment. We do not know the length of the password, but its highly probable that the password is 10+ characters. A bruteforce-attack is pretty damn difficult when you have to check 64^10 possibilities. According to my bc:
arcade@lux:~$ echo 64^10 | bc
1152921504606846976
Now, that is a pretty impressive number of queries you've got to make to exhaust that pwd-space. To be quite frank -- I don't see the problem.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
Oh, yes of course. If you don't have a firewall You are phooked!!
Ehh? Excuse me? Why the fsck do a properly configured serverfarm need firewalls _at all_? Please, enlighten us with your wisdom oh dimwit.
Firewalls _are not needed_ if you're not running services that _should not be running_ on servers for the internet.
--
Re:IP change & DNS TTL (Score:2)
They did change the IP back. The switch over was temporary to get an announcement up ... and that was outside the Exodus cage. Fortunately they did have 1 (out of 3) authority DNS servers outside of there, so they could get people over to the announcement ... eventually as cache TTLs expired.
It's already bad enough to have a 24 hour expiration on the A-record. But you don't anticipate these outages, so 1D is fairly common practice (even longer in some places trying to reduce their DNS load). But the real mistake was putting 24 hours expriation on the temporary IP. Basically that says "as soon as I change this, everyone who cached this temporary IP address is going to have to wait a day from when they first say the page, before they can get their /. fix (or other OSDN stuff)". What? Did someone actually think they were going to change the IP back 24 hours BEFORE the sites were back up? The temporary A-record should have had a TTL of less than about 30 minutes. I'd have put in 10 minutes if it were me. But then, if I were there, but if I were there, I'd have also been doing the Cisco stuff and actually tested the failover configuration.
I do recommend:
These are the kinds of things system and network administrators are supposed to do. Programmers tend to hate that kind of work, so that's why there are separate job descriptions. Just because a good programmer can install and configure a server doesn't mean that just doing that is all that needs to be done. Businesses run smoothly when people know what they are supposed to do. And in the exceptional circumstances, they're doing things they don't routinely do, and it is essential to not only have those things written down, but also make sure they do work, and can be found even in a power failure.
Re:Beware of departure from original statement (Score:2)
--
Re:Really full disclosure? (Score:2)
--
SlashCrash? (Score:4)
It definetly enjoyed reading this article and I am sure that it will be bookmarked by a fair few techie minded network admins, just in case.
Re:Cisco Support (Score:2)
The main reason that they're so prompt, is that they have a global network for phone support. When you call them, your call gets transferred to a technician who has just arrived at work (ie, if you're in the US and call at 3am, you'll probably end up speaking to a technician in central or western Asia).
Re:Another great company (Score:2)
I'm reminded of an intrusion team story about one such team that faked a package from a OS vendor (letterhead, box, etc) containing a "patch." The admins looked at the box, assumed the obvious, and installed the patch which, while fixing an actual problem, also backdoord'd their system.
I could see running a remote exploit to crash your box, sending you mail about it (faked, of course) and then sending you a "patch" to "fix" the exploit (while adding some of my own...).
Be careful, there are some tricky bastards around with way too much time on their hands. Check those MD5 sums...
Re:You know you're a cranky old grognard when... (Score:2)
Re:Microsoft Support (Score:3)
Of course, it depends greatly on who you are talking to. The platforms team does have a huge slant toward NT/2000 because that's what they support and allegedly like. Those of us in Exchange support (I'll leave it to you to figure out what part of Exch. support I'm in) handle calls where Unix servers are relays, Pix firewalls sit between systems and load-balances continually send packets off into the woods. If you *don't* know non-Microsoft stuff, aren't prepare to acknowledge that non-MS works and works well, or just can't handle the idea of public standards, you are fucked in that group.
It all comes down to who you get on the phone. If you don't like who you are dealing with, ask to speak with their manager or technical lead. Get it straightened out with them or request another support tech. You're paying for it, get what you are paying for.
(As always, my comments are my own and my employer doesn't take any responsibility for them. Like they would want to anyway.)
---
TFTP = 3 hops ? (Score:2)
I've tftp'd images to cisco's and ascend's across the Internet (many hops) without problems. It's not smart because if you loose your path to the server you're screwed, but it does work.
Re:Cisco Support (Score:3)
I don't normally swear, but if someone asks me if Cisco support is good, I have to reply: "Abso-fucking-lutely". They are easily the tightest organization out there, bar none. I don't think anyone: UPS, the Military, Wall Street, runs as good an operation as they do.
And I've sat with two engineers at 1:00am through to 11:00am as they fixed my small gateway to an ISP, not a big ticket item. At one point, they did an engineer transfer, connecting me to a different part of the world, and spent thirty minutes overlapped, with the engineers working together to make sure that the new engineer knew what the first had tried. As it turned out, the firmware storage was flakey, and the config corrupted itself semi-randomly.
Years later, I watched Cisco do the exact same thing - only this time, they correctly identified that the problem wasn't them, but in some Bay routing equipment, *and* they told us the exact commands to fix it (I was a outside consultant just watching, but I believe they even offered to telnet in and fix it themselves).
So, yes. Cisco is the only brand I will buy, no matter how expensive they are. Think of the extra expense as insurance. You *may* not need it, but it sure pays for itself if you do.
--
Evan
Re:OSDN, Audit ALL of your systems NOW. (Score:2)
...unless the risk of being comprimised within that short period is outweighed by the information you will gain by testing around your firewall. It is a simple trade-off.
Re:Eeep - scary moderators! (Score:4)
A quick Google search for "Anne Tomlinson" returns an orchestra conductor and someone in a retirement community.
If it was a real post, CmdrTaco probably would have ignored it. His good humored response makes me think it was a troll.
Is there any evidence that it was real?
-B
Re:Cisco Support (Score:4)
Re:Microsoft Support (Score:5)
I beg to differ.
That article details calling the 900 line, but even with support contracts, most MS tech support reps toe the company line in a distressing fashion.
"Unplug all the unix servers, that'll fix it"
"Upgrade everything to Win2k Adv Serv, that'll fix it"
"Upgrade to SQL Server (from Oracle), that'll fix it."
They seem to have no ability to distinguish which network components could be involved in a problem and are unwilling to accept that you've already localized the problem.
Case in point, there was a problem where two WinNT boxes wouldn't see each other. They both had IPs, they could both ping everything else. They were connected via a 100mbps switch.
We made sure each properly had an IP, that it could reach other machines, that the switch worked, and then swapped ports with two machines that were working just fine. We also tried isolating these two machines on their own switch, to avoid potential IP conflicts.
When we called the support number we honestly described the situation to the tech. He asked what else was on the network. We explained that it was in a different IP range, but on the same switches as a bunch of Linux machines, an Open BSD (firewall for the desktop machines), and a couple Suns (doing something for the other department, dunno what.)
He then proceeded to tell us that it was the other computers, despite our telling him that we had isolated the NT boxes in question on their own switch and we still had the problem, but when we put a third computer on, both of the NT boxes could reach it just fine.
We eventually lied to him, telling him that yes, we had unplugged all the unix machines, etc. (Like we're going to just unplug out company on the say-so of a moron, and like two junior techs would have the authority to do so anyway.) So now jim-bob starts to help, by telling us that Win2k is so much better, etc, that we wouldn't have these problems with it, etc.
When we flat-out refuse to "upgrade" to fix this bug, his advice is that we format the drives and reinstall. ARGH!
We finally convince him that these machines are somewhat important and we can't just wipe them everytime there's a small problem.
After over an hour with this jack-off, we hang-up, problem unresolved.
We get permission from the boss to call someone in... So we look through our list of contacts and grab someone whose card says they deal with networking and windows. Call him up. As we're describing the problem he listens quietly, grunts affirmatively when we describe how we isolated the problem, agrees that it couldn't be any of the other machines.
Then he says, "It sounds like it's an issue with a bad route, type 'route
He said that it, whatever it was, was a very common problem where the machines basically forget how to get from A to B. That command zeroed the routing (which didn't show any bad routes) and the reboot brought it back up.
Cost, a 15-minute phone consultation. $45
Microsoft tech support was basically a sales department, staffed with the marketing rejects.
So, don't EVER believe it if someone tells you that MS supports their products. Any company whose line is "Format and reinstall" has no business calling a product "Server", let alone claiming they're in the enterprise level.
Schon, earlier in this thread, said "Rebooting doesn't solve the problem!!" I wonder what he'd say about formatting and reinstalling.
Re:Eeep - scary moderators! (Score:5)
No, we don't have a right to know. Ms. Tomlinson's departure is between her and her employer; not some tabloid expose for a bunch of overly curious rumor mongering conspiracy theorists. I wouldn't be surprised if the people who blurted this out on a public forum haven't been seriously bitch slapped by HR.
As a community it would be best to let the matter drop. I'm sure if you were in Anne's position you'd be severely pissed. A little perspective and some empathy would be appropriate.
Re:You know you've been using windows too long whe (Score:3)
And who's to say that the problem that's being experienced will be fixed by a reboot?
We had a server running, one of the things it did was SMB sharing - one of the drives (the one dedicated to non-critical SMB shares, in fact) died.. This box was doing MUCH more than SMB - it was also our internal DHCP, and DNS server
I was out, and one of our MS guys decided "I don't know what all these error messages mean, but I can't see my windows drives, so I'll just reboot it." Because the drive was dead, the machine wouldn't boot. He took the WHOLE DAMN DEPARTMENT OUT - nobody had DNS, and when people's windows machines stopped working, the solution was (guess what?) REBOOT them - so THEY stop talking to the network altogether.
Now, the kicker is that the drives in this machine were hot pluggable. If the reboot hadn't happened, I could have swapped in a new drive, restored from last night's tape backup, and people could have continued working. Instead, because the machine was rebooted the whole department was down for several hours.
The mantra stands - REBOOTING WILL NOT FIX THE PROBLEM. And if you reboot before you know what the problem is, then not only don't you know if it will help at all, but you also don't know if it will make the situation worse.
sometimes getting back online as fast as possible is more important.
That's the trap - there is no guarantee that rebooting will do this - and you might just be screwing it even worse.
Getting back online as fast as possible involves solving the problem first - REBOOTING WILL NOT FIX THE PROBLEM.
Re:You know you've been using windows too long whe (Score:4)
Even though you think you're saying the opposite of what I said, you've hit the nail squarely on the head - rebooting never fixes any problem.
It may temporarily fix the symptom, but the problem is still there.
It is possible for routers, Linux boxes, etc to crash.
Yes, it is. But if they crash, it's for a reason - perhaps there is a bug in the configuration, or firmware; or perhaps it's hardware.. but what's important is that rebooting will not actually fix the problem, all it will do is temporarily alleviate the symptom.
If the problem is with the configuration, then you fix the configuration. If there is a bug in your software, you fix that. If it's hardware, you replace the faulty hardware. If it's firmware, you upgrade the firmware (or replace the unit with a different model, from a manufacturer who actually does quality testing.)
But you do not just blindly reboot - if a reboot is required, you do it after you've discovered WHY the machine has crashed, and you've fixed it. Once again, the mantra is "Rebooting will not fix the problem."
You know you've been using windows too long when.. (Score:5)
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."
You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."
They usually respond with "but I didn't know what else to do."
To which I answer "Repeat after me - REBOOTING WILL NOT FIX THE PROBLEM."
"But I didn't know what else to do."
"Then call someone who does - REBOOTING WILL NOT FIX THE PROBLEM."
Thank you. (Score:3)
BTW: Are you going to plan any redundancy/failover drills as a result of this?
Really full disclosure? (Score:5)
Was Rob just popping off at random, or was that little bit removed trying to cover
Jes' wondering...
Re:You know you've been using windows too long whe (Score:5)
You can make a case that valuable troubleshooting info is lost when systems are rebooted. I agree, but counter that all good systems should have detailed event logging. Leaving the system online and intact is the best way to root cause a bug. But, sometimes getting back online as fast as possible is more important.
Let this be a lesson..... (Score:3)
Don't slack. When you slack it bites you in the ass. Maybe not today, maybe not tomarrow, but someday, someday soon, it will.
Test your failover configs. How? By actually making them fail. During the maintaince window, power that primary router/firewall/load balancer down hard and see if the fail over works. It's like testing back ups, kids. You have to know they work before you need them.
Realistically develop on call strategies. OSDN didn't really have a net ops staff of four. One had quit (why are they counted?), one was in hospital, and two had weak "couldn't reach my cell phone" excuses. That just don't work in the real world. If you are on call, you are on call. The "phone too far away" and "battery fell out" just don't cut it in the adult world of professional net ops. Get a satellite pager, and if you are on call, make sure it's on, and near you so you can hear it.
Don't bash your employees/ former employees, particularly during a heated situation. Shows no class. Besides, if you are such a hot shit. grab that console and fix it. Otherwise, keep your mouth shut. Besides, who is in charge of making sure the people that are hired are qualified? Hmmm?
Document your shit. It's not that hard. Visio can do much of it for you. I'm going to break an NDA here, but the Exodus Service Agreement states that all machines and cables are to be labeled. That is so when the dude (or dudette) has to leave the NOC and enter your cage to reboot your lame box, they know what is going on. Also works well for when you net ops staff is too concerned with getting drunk or laid and your poor programmers have to go in to fix the network.
Some folks really went above and beyond, but it seems to me that the management severely dropped the ball.
Is VA really ready to abandon the hardware market for software services? One has to wonder.
Dave
been there before...
Re:Responsive techies (Score:5)
Rumour has it the conversation went a little something like this:
[Kurt] Hi, cisco tech support?
[TAC] Yes
[Kurt] this is Kurt at slashdot...
[TAC] Oh my god, its about time you called us. You've been offline for nearly 24 hours, we're all going through withdrawls. Hang on a sec, our top techs are dying to help.
I talked to a friend in cisco TAC (Brussels) who said that they regularly lurk on
Since summer weather had come to Europe, I, personally, did not notice the outage. But I promise in the futur to not have a life.
the AC
[Note to Kurt and company, make sure you return your customer satisfaction survey. Those TAC folks live and die based on keeping a very high level of sat scores. I think they need a 4.85 (on scale of 1 to 5) just to keep their jobs within cisco, and a 4.89 to get a raise. So 5's across the board, and in the comments put a link to this
Disorganized? (Score:3)
I've seen this scenario over and over again... one guy who knows and understands the network, ten people standing around at the equipment trying various silly commands to fix it when it's down...
Here's some suggestions -- you probably already realize that 90% of your pain was avoidable, but everyone has to learn "the hard way" the first time, right?
We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer...
That's called bad documentation that no one ever reads.
Get your networking guys to document TROUBLESHOOTING techniques and to teach the programmers how the network is acutally set up and why. You have plenty of talent capable of understanding how it all works there.
Get more than one way (cell phone) to reach your most important network engineers. Pop for a guaranteed delivery text pager and ask them to carry that as well as the cell phone.
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords."
Paper. Wallet. Put them there. Better yet, PGP encrypted password escrow somewhere that anyone can get access to, and a locked cheap fire safe at the office with the public and private PGP keys on a CD-R inside -- for just this type of scenario.
So I asked the Cisco technician, Scott, to telnet into our switch...
Bad bad bad... telnet = bad. Good network security always goes out the window when the network's down...
So he's in the switch and he's disgusted and horrified by how we have it configured...
This is probably the most important hint during your entire outage... your network people either don't know what they're doing, or you're not ALLOWING them to do their jobs, or they're understaffed, or whatever other excuses can be made up ... your call, but don't forget this -- if Cisco's "horrified" by your configs, there's a serious issue you need to find and correct somewhere in your organization. Everything from training, to documentation, to troubleshooting procedures needs a serious walk-through.
The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...
DO FAIL-OVER TESTING. If you'd have done a fail-over test of this config you'd have known it didn't work correctly during a nice scheduled time when your network engineers are available and at the equipment, instead of the middle of the night during an outage with all of them MIA. This is so easy to avoid.
Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups.
Nice of them to tell you. Who is the customer here again?
Put a $20/month POTS line in your cabinet for goodness sake!
That's enough... I'm appalled, but hopefully you will straighten out some things now that the site was down for an extended period. Done properly, network downtime should be a rare event, usually caused by human error, not by bad configuration.
Many outages are unavoidable, your outage sounds like it was avoidable, and certain steps could have been taken to minimize the length of the outage.
Re:More Writeups Needed (Score:5)
What you said.
I did a bit of (very junior-level) sysadminning back in my day.
First thing the BOFH told me was "Buy a hard-cover notebook. Not spiral-bound. Not softcover. Write down everything you do. Feel free to doodle and write obscenities if you like. Someday you'll thank me for this".
I was a bit befuddled, and then he showed me his notebooks. Five years of dramatic fuckups and even more dramatic recoveries. His own personal "deja.google.com" (but it was 1992, and long-term USENET searching hadn't been invented yet, hell our office was using UUCP!) for everything he'd had to work out from first principles on his own.
And thus was the PFY enlightened.
(And yes, I did buy him a beer in late 1992, when something I wrote down in mid-1992 jumped off my page and saved my ass.)
Not like my experiences.. (Score:5)
While I agree that I usually get someone at cisco who knows what they're talking about, it is very rare in my experience that it happens in only a minute, although it does occasionally happen. A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
All of that being said, I would have to agree that cisco's TAC is probably one of the best tech support groups I've ever worked with.
--
OSDN, Audit ALL of your systems NOW. (Score:3)
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
More Writeups Needed (Score:5)
"Welcome to the HOWTO. My setup worked the first time. Why didn't yours?"
Granted, noone wants to see stuff on the 'net go down (and we're glad you're back,
Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.
Hear, hear! (Score:3)
X: "What happened to Anne?"
Y: "I don't know; all I know is that she anne-tomlinnsoned from work."
Note that this verb should have the subject of the remark used as the subject of the verb, and the organization left as the indirect object. This should be adhered to regardless if the subject quit, was fired, laid off, died, disappeared, never existed, or there was a mutual decision for the subject to leave. In fact, the verb should mainly be used when the method of departure is unknown or never officially stated (or, even officially acknowledged).
Also note that this verb should NOT refer to a person leaving another person, as in "Fred's now-ex-wife had tomlinsonned from him." The number of people (one or more) that are the subject should be less than the number of people who the object represents.
Continuing on, this verb should NEVER be applied in a self referential matter, IE: "I anne-tomlinsonned from them". This implies that the subject either A) knows the reasons, and is just being a prick about not stating them, or B) the subject does not know the reasons due to massive thick-headedness.
Lastly, this term should only be used to convey the sense of inpenetrable mystery surrounding the departure. It would be oxy-moronic to state: "Ted tomlinsonned because he was bored and wanted to leave." If the mystery surrounding the departure is penetrable, use another phrase.
anne-tomlinson, v,: to leave or be removed from a group under extremely odd, and mysterious, circumstances; especially when the actual method of departure or initiating party of departure is unknown. More especially, when the actual departure is apparently covered up or left un-acknowledged.
tenses: anne-tomlinson, anne-tomlinsons, anne-tomlinsonning, anne-tomlinsonned, had anne-tomlinsonned.
Re:Cisco Support (Score:4)
Our results were much the same. Very, very responsive people.
I have to agree with Taco, if they gave this kind of service down at the DMV, they'd be picking up passed out folks left and right.
*scoove*
It MUST have been a FREAK OF NATURE! (Score:3)
Wish the companies I deal with on a regular basis ever showed that level of skill when I need help. well... hmm... actually Speakeasy is generally pretty good about accepting that my problem is accurately diagnosed and figuring out what's wrong. And Viewsonic the other day was able to provide refresh-rate specs on a monitor I wanted to order within about 60 seconds of my placing the call (Though they dropped the ball by not having the specs I wanted available on their web page) What is this trend of good service? It's scaring me...
Re:You know you've been using windows too long whe (Score:3)
Just because someone screwed at your work doesn't make your mantra a universal rule.. Especially when dealing with something like a router or a switch.. These things are normally not meant to be user serviceable and will take a reboot just fine(no hot swappable drives there).. You could have hit a 1ppm problem and rebooting just brings everything back online until statistics kick in again. Little uptime is better than none.
Sure it won't fix anything per se, but getting things normalized enables you to start concentrating on the problems at a less hectic pace..
they're gonna have a field day (Score:4)
---
Re:You know you're a cranky old grognard when... (Score:3)
That sounds lame as hell. (Granted, though, configuring a Pipeline 50 goes right over my little bow head, much less a Cisco. So yes, I'll stipulate that I'm talking out of my ass here.)
The act of rebooting should be just another even that gets logged, NOT a synonym for "oh, and by the way, you can delete the old log file now."
IMHO log deletion should be done on a calendar basis; everything more than x days old gets purged automatically. What's Cisco's rationale for auto-deleting logs during the boot process?
Yeah, more details please! (Score:4)
C'mon, tell us the full story!
Eeep - scary moderators! (Score:5)
BTW, feel free to mod me down, prove my point and compound my paranoia; I've got karma to spare : )
Re:Cisco Support (Score:3)
Re:Like reading old issues of the RISKS digest (Score:3)
hang on, i can do this... (Score:5)
I. Call to Adventure
"By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond."
II. Meeting the Mentor
CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
Whoops, that's not it.
"So I called Cisco tech support."
There we go.
III. Obstacles
"Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you."
IV. Fulfilling The Quest
"He bounces the switch... copy startup-config running-config
V. Return of the Hero
"The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers."
"Tuesday was router reconfig day."
VI. Transformation of the Hero
"At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be."
"We certainly aren't going to make the same ones [ed: mistakes] again!"
Peace,
Amit
ICQ 77863057
Responsive techies (Score:4)
Re:Rebooting (Score:3)
Take a relational database for example; there is so much, that can go wrong with it. For starters, there are bugs in such complex products and fixing them (save for Postgresql) is beyond your control.
But it must not even be a bug in the database code. It can be something in your network component (we chased cases for month which turned out to be a DECnet issue, but where attributed to the database server), it could be the fact that the db vendor compiles his product on multiple platforms and it's virtually impossible to test every functionality of a new release on every supported platform. Yes, I know that in an ideal world this should be done, but it isn't.
Assume it would be possible to perform such tests. Save for propriatery (or semi propriatery) architectures like OpenVMS/AXP you can have so many different hardware- and network components, that it's just not possible to forsee all eventualities.
After ruling out such possibilities, we're not there yet: What are the query characteristics, how many concurrent users do when, what. What front ends do they use, how are they connected. The problem may even be caused by a component that has nothing to do with the database engine (Access front end, anyone ?)
Although the fundamental cause for the problem might never be detected a reboot of the data server might fix the problem and it will never occur again, since the same combination of factors occurs so rare that it's even impossible to reproduce the problem.
However, the [alt-ctrl-del] attitude of younger IT folks (specifically those that grew up in a PC environment) makes me barf and indicates just how clueless a lot of those folks are. You never reboot a productive IT component, unless there is no other choice or in the context of your normal maintencance cycle (memory leaks do occur in software)
Correct. (Score:3)
Either Anne is real or she isn't. If she's real, this is an internal matter that we really don't need to interfere in. If, as the "Anne" poster suggested, she quit because Taco and Hemos are hard to work with, she was within her rights and should get at least some support from a community which often says "Quit! Now!" to Ask Slashdots about PHBs.
If she's not, this is all a big waste of everyone's time, and possibly the best troll we have ever seen on slashdot. (An account by that name [slashdot.org] has a brand new uid (462836) and zero comments.) Think of the trolls you've posted - how many led to 100s of posts on other threads, conspiracy theories galore, and posts by #1 and #2? Whoever did this (if not Anne) should get mad props from the troll fans, but should not take any more of our time.
My bet is that she's not real. But in either case we should drop it and get on to more important things.
So, what was the problem...? (Score:3)
-S
Kenny (Score:5)
-S
Re:Beware of departure from original statement (Score:5)
Everything failed at about 7AM Sat. Dave was at Exodus between 8:30 - 10:30AM Sat (didn't look at the log book when I got there). Kurt arrived shortly after that I belive (again I didn't look at the log book). I arrived there around 11:40AM. Sat.
And yes my battery was lose on my Nextel. Just takes a little pressure upwards to lossen the batter on the i1000plus I have. The batter doesn't fall out, just loss enought so it lose contact and turns the phone off.
I have now taped the battery in place!
Yazz Atlas
Re:Mad props to CISCO! (Score:3)
ehr... 4 T1's = 4 x 1.544 Mbit. (=6.176Mbit... der)
1 OC-3 = 155.52 Mbit.
Not really similar, eh
Grr... (Score:5)
After having been modded down next to the goatse links, somebody please explain to me how the hell we're supposed to discuss the decidedly strange disappearance (and subsequent reappearance) of this story on the site without getting modded as "offtopic"?
Just where, exactly, are we to discuss this little point? For example, why did this story disappear? Was it technical? Was it editorial?
For a group that is so damned keen on openness and truth, it strikes me as somewhat ironic that several dozen mod points have been used to effectively supress this part of the thread.
I want to know what happened. Others do to. If you can't give us a decent place on Slashdot to discuss this issue, then don't mod us down as offtopic!
Re:Grr... (Score:5)
You'll understand my consternation, though, upon seeing my (admittedly offtopic) post on the Shared Source article regarding the disappearance of this article modded down three points to -1 in the course of roughly one minute, and the seemingly similar fate of a good many other posts like it. Also worth noting is the fact that this article touches on what many would consider a rather sensitive issue with the OSDN and /. crew right now. I don't like conspiracy theories much, but I'll be damned if the situation didn't seem, well, rather odd.
Might I suggest, though, that once stories are actually posted to the front page, they remain as is, even if the order of presentation is not the most desirable? Consistency is key, and having articles disappearing from the front page is not terribly consistent.
How many times do we have to say it... (Score:5)
No matter how FUBAR'd your router/switch/firewall configuration is, it's still no serious obstacle to crackers, Robin.
Re:More Writeups Needed (Score:5)
I discovered the RISKS digest when reading the about software engineering, and it has certainly helped me think about failures and recovery when designing and building systems.
There is also the underlying thesis in the article about how complexity, whether in a bungled redundant network connection or just a large poorly documented, poorly tested, and poor configured system is your enemy in building reliable systems. A lot of systems were built like Slashdot during the dot.com IPO craze, I wonder how many of us rely on such poorly built systems?
Building complex, reliable networks is hard, and expensive. About 3 times what your estimate is, which is about 2 times what you boss expects to spend.
New slashdot sql ... (Score:3)
Is there someone else outthere hosting a site where we can have a non-biased discussion ?
BROWSE AT -1 Checkout how many posts have gone straight in at -1 (and this one too will, I betcha...)
Re:Eeep - scary moderators! (Score:4)
Hey, try browsing at -1 nested - seems like everyone who's questioned the story about the woman who "quit" (Anne Tomlinson?) has suddenly been modded down. I'm not usually one for conspiracy theories, but is this surprising anyone else? I think that the question couldn't really be more on topic, and they're hardly flamebaits or trolls - what's up?
Yes, I'd like to know too. I didn't see the original "Slashdot Back Online" announcement, but I did see Maswan's listing of the 3 different versions of it [slashdot.org], with all reference to the "she wasn't actually qualified" girl removed. And of course, the message from Anne Tomlinson [slashdot.org] decrying her treatment at the hands of Jeff and Rob. And an astonishingly rare post from CmdrTaco [slashdot.org] (his only post in the last several weeks) dismissing her as a troll (and being of course sycophantically modded up to 5).
So I, and no doubt many other loyal Slashdot readers, would like to know - what really happened? Who is Anne Tomlinson? Why did she quit? Why has all reference to her been purged from the site? Why is everyone who asks about her being modded down to -1 so quickly that it is obviously editors doing it?
We have a right to know.
Beware of departure from original statement (Score:5)
Where does this mysterious woman fit into the story above?
hrmm... (Score:3)
-
sean
Ouch! (Or When a Redundant System Isn't) (Score:4)
I remember when I started out in computer networking (and it didn't seem like it was that long ago), I was told this by one of the other technical members of our team, something that I haven't forgotten: redundancy in a system is necessary not only in the hardware and software in that system, but also in the resources that are used to keep that system running (that includes of, course human resources, as well as power HVAC, and so on).
Too often, the human part of the redundancy equation isn't totally factored in. When you don't put all of the human factors into the redundancy equation, you have a redundant system isn't really redundant.
Of course, it helps if you have a vendor that will work with you (and those of you who remember working with Novell servers in "the old days" know what I'm talking about, too).