
Worldcom's Frame Relay Down 86
Jim Trocki writes "MCI/Worldcom's frame relay network has been hosed for at least 8 days now. Read the story.
This is the recorded message that is heard on Worldcom's tech support line:
'In accordance with our plan to repair the instability in one of our frame
relay network platforms, we have taken our domestic frame relay platform
out of service for a 24 hour period, from noon Saturday to noon Sunday
Eastern standard time. As a result, your frame relay service will not
be available for traffic.' " Here is the MCI Worldcom web page on the situation. The news.com article says that this outage might cut into their profits. It seems this is quite a severe outage...
How nice. (Score:1)
Re:More on the story...MCI to lose CBOT as Custome (Score:1)
Single attached networks (Score:1)
Re:How nice. (Score:1)
What nobody's said yet (Score:1)
More on the story...MCI to lose CBOT as Customer (Score:2)
http://www.chicagotribune.com/tools/search/resu
...hmmm, wonder what MCI is using for op systems behind the fiber(especially after the holes noted in a certain OS from the Northwestern US by the folks at NETWiz (see next article on the Internet Security Audit))?
It makes me happy. (Score:3)
Stories like these give me someone to point at, saying "See? This computer stuff is goddamn hard. If a multibillion dollar corporation specializing in networking can't bring their network up, how the hell can I do anything on $58K per year? Gimme a raise!"
CBOT is trying to deflect blame to MCI... (Score:2)
Granted MCI has screwed up badly - whoever does their change control should be fired over this - but the Chicago Board of Trade deserves the brunt of their customers anger.
You don't run a critical network without backup. Their slow links should have ISDN backups, their fast links should have dedicated reduntant connections. That's just common sense.
Re:What could take this down... (Score:2)
As I understand it, it's their frame relay that has gone down. That would most certainly be proprietary software/hardware. According to the article, they're using Lucent.
Re:Lucent the Ostrich (Score:2)
Not having a backout plan is unforgivable when that many people and businesses depend on the service. Being truly prepared for a full backout is difficult, but is doable.
To further the problem, they apparently don't have much in the way of failover preparation in place, or an established mitigation procedure (or if they do, they stupidly switched everything to the new software at once.
The final problem at MCI has nothing to do with technical issues. Many are indicating that their problems would be much less severe if MCI would be more forthcoming with information, and would make a public statement. The only thing stopping them from doing that is arrogance, execs too busy packing their golden parachutes, and a determination to spin control themselves into the ground.
Re:Single attached networks (Score:2)
For large business, I agree, not dual homing is a bad idea.
The real issue is small business, which is less likely to survive the outage in the first place, and cannot afford to dual home. Unfortunatly, connectivity is still WAY more expensive than it has any right to be. For small business, doubling that cost is out of the question.
Re:Single attached networks (Score:2)
I don't advise any business to NOT dual home if at all possable. In fact, I would advise business to put their servers in a colo facillity which is dual homed.
What I'm saying is that some businesses CANNOT afford to do that and provide decent connectivity to their office as well. I can understand their temptation to go with a single provider and hope for the best.
Re:What could take this down... (Score:2)
I can see that. According to another poster, MCI initially told him they had messed up their routing tables.
Reports are that MCI has had some trouble for months, To me, that makes the cause if the outage an open question.
Re:ISPs endangered (Score:2)
Many small ISPs are colos that lease a dial-up bank from their provider. Probably, the dial-up is linked to the colo servers through a Worldcom frame relay. Thus, the email is down. It's a pretty common setup these days.
Re:Single attached networks (Score:2)
What a corporation CANNOT pay for and what they WILL NOT pay for are two different things
CORPERATION? sure, odds are they can and should afford it. I am talking about SMALL operations. Think sole proprietors (that is Mom 'n Pop). $1200 could represent a significant portion of their income while they are getting established. It could make the difference between surviving long enough to become profitable and closing the doors.
Those are the people I feel for in this situation, especially since they are probably at the bottom of MCI's priority queue.>/p>
I agree that this incident will re-define what corperate america will pay for wrt redundancy. In a way, that is a shame since it essentially rewards the network providers for being unreliable.
Dual-homing can be harder than you think (Score:1)
In a former life, I was an co-op student at Western Union (Anyone remember them? They used to actually transfer data as well as money.), and I was involved in leasing data lines from AT&T and the local telcos wherever our customers needed connections.
Some of the larger outfits were indeed interested in redundancy and paid a premium for, say, two links from NYC to SF, one by way of Chicago, and another through Houston. We frequently had trouble verifying that we were really getting independent routes.
The telcos bundled connectivity (back when a single voice line was lotsa data links [after they'd been digitized]) in a hierarchy so deep that it took days or weeks to verify that every link in the route used facilities physically separate from every other.
Why? 'Cause to Ma Bell, bandwidth is fungible. Got noise on link 37A from Manhattan to Albany? Take it out of service for maintenance, and swap in some spare bps from Manhattan to Jersey City to Albany. It's all the same....
Until a manhole floods in Jersey City, and *oops* it turns out your route to Chicago (formerly via Albany) is now cheek-by-jowl with the wire to Houston, and you're dead!
Among other things, we supposedly rented circuits with huge Do Not Reroute tags hanging all over them, but on occasion someone overlooked it, or worse, the facility two levels up the hierarchy got rerouted and our link went along for the ride unknowingly.
I wish them well---it can only be much worse with fiber optics these days.
Diversity is Power (Score:1)
I bet a lot of the companies it applies to more are those valuing economy of scale over reliablity. ``We can save 0.3% by buying 10k Grace L. Furgeson routers? Great! ... They don't interoperate with any others? That's OK, we'll buy 20k and use them exclusively.''
So now when the GLF equipment shows a bug under certain wildly-unlikely circumstances, you're sincerely screwed. Much better to insist on at least two (and if you're serious, three or four) vendor's equipment, interoperating to an open standard, throughout your network. That way, as with genetic diversity in crops, livestock, and humans, you're much better able to withstand climate variation and new diseases. Half your network may be down with the bug, but you still have significant bandwidth running.
(Of course, this doesn't innoculate you against errors in the protocol, but that's better debugged than the equipment implementing it.)
Re:How nice. (Score:3)
Re:Lucent the Ostrich (Score:1)
We'll survive it. We have a connection to another local provider (who uses UUNet as well but doesn't seem to be experiencing the same problems) and a PPP T1 link to AT&T. A simple route-map to prepend our AS to the UUNet BGP announcements and whala, AT&T handles most of the traffic.
As far as a secondary feed, I don't know what you're talking about. If you have a frame-relay circuit with MCI Worldcom, a secondary frame-relay circuit won't solve the problems. We'll expect compensation for violation of our service level agreement and CIR and be done with it.
This is a big deal. Someone will lose their job or face severe disciplinary action over it but they'll figure it out and things will be back to normal. Besides this, we've had *EXCELLENT* service for several years.
Re:Not quite 8 days (Score:1)
Re:ISPs endangered (Score:1)
Re:Here's a thought (Score:1)
(sotware--overimbibing instructions)
Re:BellCore (Score:1)
ISPs endangered (Score:2)
There's a story on C/NET "ISPs say MCI outage could kill [news.com]
businesses" that's more than a little bit scary. Does MCI have their own ISP business? One that would just as soon see the little guys dry up and blow away? Do they have any corporate buddies that do?
Here's a thought (Score:3)
Lucent? (Score:1)
Software upgrade = better sureveillance ... (Score:1)
Re:What could take this down... (Score:2)
Re:How nice. (Score:2)
I think this applies more to some companies that others. We had great difficulty with MCI WorldCom specifically, with something as simple as turning on a couple of T1's that had 45 notice of the install date for.
It's not all big companies, though. We have have some services from Frontier (itself the result of many mergers), and they generally display a much higher level of competence.
ISP's affected? (Score:2)
Valued Customers:
At 11:15 pm 8/13/99 WorldCom, our Global Service Provider, notified our Network Operations Center of the need to perform emergency maintenance on their Frame Relay network beginning at 12 Noon (EDT) Saturday 8-14-99 and finishing at approximately 12 Noon (EDT) Sunday 8-15-99.
During the course of this emergency maintenance, you may or may not experience the following: congestion over the network, latency and potentially, loss of connectivity. The work being performed by WorldCom necessitates the complete shutdown of all frame relay switches within the WorldCom network, and a controlled, one by one, reinstatement of each frame relay switch back onto the network.
We have been assured by WorldCom that every effort will be made to reduce the impact to our network and to resolve the issue necessitating the emergency maintenance as expediently as possible.
We will notify you once we have received confirmation from WorldCom that all work has been completed.
Thank you for your patience and continued business,
BellSouth.net
Re:Here's a thought (Score:1)
I've got no hard facts to back those statements up, just experience
Erik
Has it ever occurred to you that God might be a committee?
Re:ISPs endangered (Score:2)
Seriously though...my question is...does MCI WorldCom use the same service level agreements on their frame cloud that uu.net does on their internet access? If so...this could be a *SIGNIFICANT* hit to MCI WorldCom's pocketbook.
For those of you who don't know...UU.Net's SLA's basically say that for every hour of downtime, you get a *day* of credit on your circuit. So...for 8 days of downtime...that's over 6 months of service for free!
Ouch.
Jeff
Re:Text of CBT letter to MCI CEO (Score:1)
While MCI and apparently Lucent share a major percentage of the blame in this particular situation someone at CBOT screwed up big time. What next? Someone told us we did not need a UPS for our systems?
Does anyone else remember the flack Ebay took for apparently running their business without redundancies in their infrastructure?
Re:Hmm, Are these ATM lines? (Score:1)
Leads me to wonder why it wasn't handled a bit better, but them's the breaks. We had flaky access last weekend and off and on through the week. Completely dead yesterday at noon but up early this morning.
I just pushed our machines thru a proxy server on a dialup during the outages to avoid complete loss of connectivity. One good reason to have lots of small companies to choose from for your service.
Re:Single attached networks (Score:1)
1. If you pay $1,200 US for service, you get service. If your service is free, then you take your chances. If I go to a restraunt and pay money and don't get service, it's not my fault. It's not the little ISP's fault if they don't do redundancy. I think the little ISP's should beat themselves up, they don't need us to do it. The bottom line is that MCI went down, not the little guy and MCI should pay for everything not the little guy.
2. "any business" is not a corporation.
3. Your logic should also apply to MCI, perhaps they should have reduncancy built into their network so that this doesn't happen.
So... (Score:1)
Lucent the Ostrich (Score:1)
The real losers here are all the small ISP's who rely (possibly unknowingly) on the MCI backbone since it's wholesaled to them by other companies. 8 days of downtime will put some of these guys out of business. They don't have the cash flow to refund all their customers for lost service.
The wholesalers owe it to their ISP customers to have a secondary feed (hopefully using non-MCI, non-Lucent equipment) to prevent such a disaster as this.
Re:Lucent? (Score:1)
Oddly enough, the engineer we usually deal with a UU has been unavailable the past 2-3 weeks because they had him working on a "big project." He didn't even show up as at work when he was on that project. I wonder if this was it?
Re:What could take this down... (Score:1)
I've got a friend who works at Lucent, and when I told him about this, his comment was "All big telephone companies blame us when their networks go down. We waste more time proving that our products weren't the cause of outages than I care to think about."
Which makes some sense. If you were in charge of damage control at MCI, would you want to say,"Yes, this was entirely our fault, and we're incompetent monkeys for letting it go on so long"?
How it happened. / Why is it not fixed? (Score:5)
I do not speak for MCI/WorldCom or Lucent, I do not work for MCI/WorldCom or Lucent, I have no affiliation with MCI/WorldCom or Lucent, and I do not have any sort of business relationship with MCI/WorldCom or Lucent.
That said to cover my ass, here's what appears to have happened.
MCI/WorldCom has had capacity issues since mid-97. When it was just MCI, they stopped selling DS3's for a period of time a few years back because they simply didn't have the capacity. MCI has long had capacity issues, and as a direct result, has typically run their equipment at or near capacity. What appears to have happened is a cascade failure. I'm going to try and put this into words, but it's easier with pictures. Trust me.
What happens in a cascade failure? A network, at or near capacity, has a failure in a single core router for some reason, in MCI/WorldCom's case, a failure due to software. The load from that core is quickly distributed to the remaining core routers. These remaining core routers, being at or near capacity, almost immediately gave way under the load, failing due to other various reasons triggered by the excess load. As each router failed down the line, the load on the remaining routers increased exponentially, cascading into a full network outage.
Now, why isn't it fixed? Recovering from a cascade failure is extremely difficult. This is speaking from experience. I had a server cascade failure once; it's not an easy recovery from that. A network even moreso.
To recover from a cascade failure, load has to be taken out of the picture for a period of time so as to be able to bring things back online without any load. That's the reason for the 24 hour planned outage, I believe. Not working for MCI/WorldCom or Lucent, I can't be sure or confirm this. When the load is eliminated, what has to be done is each router all the way on down the line has to be fully reset, restored to the original configuration, reconfigured, then brought back online one by one, with *NO* load on the network. If there is load on the network equivalent to what there was when it went down, then that router will immediately fail again. After each router is brought back online, and tested, each interface must be brought back up, one at a time, so as to make sure the load does not cascade out of control again. Once this is done, stability can be assumed to be restored, assuming no more interfaces or connections are added.
Why MCI has taken so long to take this action, I don't know. Were I running the network, that would have been the first action upon noticing the cascade. Shut down all interfaces, cut off the load so that the cascade can be halted before the entire network is affected. Immediately notify all customers that, flat out, "a router failed due to a software upgrade, we don't know why, and we had to shut down all interfaces for a period of time to prevent the failure of the entire network. We don't know when we'll be able to get everyone back up." Furthermore, I'd do everything I could to get effected customers back online, and to find a way to get the customers attached to the failed router setup somewhere else, so as to get the network back up and be able to troubleshoot the failed router as quickly and cleanly as possible.
But like I said; I don't work for MCI/WorldCom or Lucent. I can't garauntee any of this information to be true. To be quite honest, I'm glad I don't work for either company. They have totally mishandled this whole situation, they're going to lose a lot of customers, and I believe they deserve it. You don't get and keep customers by keeping them in the dark and being very vague. Hell, if I call the Cleveland Verio NOC, they'll tell me exactly what happened when the T1 at work goes down. Either the 7513 had a failed RSP, or both powersupplies failed, etc. (And still my coworkers wonder why I hate Verio.. maybe because they're telling me these kind of things weekly?) MCI/WorldCom and Lucent have turned this into a disaster of proportions that never should have happened. Oh well. Their loss, other's gains.
Welcome to the Internet in this day and age, where information comes at a premium, and customer service is something of the past. Sad but true.
-RISCy Business | Rabid System Administrator and BOFH
Can't fix software, eh? (Score:2)
Re:What is it? (Score:2)
Looking for a job? (Score:1)
Field Service Engineers
You'll install, test, and repair circuits and equipment at the customer premises. You'll satisfy customers with face-to-face interaction. You'll participate in performance improvement efforts, and be responsible for an MCI WorldCom vehicle. You'll perform dispatch duties after hours when necessary, and participate in a call-out rotation. To qualify, you'll need an AS degree in a technical field, 1 year of field service or 2 years central office experience, knowledge of personal computer software, and hardware operations will be beneficial. We have this position available in the Northern and Western suburbs as well as downtown Chicago.
Y2K related? (Score:1)
Re:How it happened. / Why is it not fixed? (Score:1)
Re:ISPs endangered (Score:1)
I agree with you, though, on the mail bit, unless they're not hosting the mail server, just sending smtp upstream. That doesn't sound right, though.
--bdj
Re:What nobody's said yet (Score:1)
--bdj
Re:Single attached networks (Score:1)
Re:Here's a thought (Score:1)
-NavisCore
Re:Lucent the Ostrich (Score:1)
They've been quiet because it's not their problem, this one's MCI's. Read my previous post.
Ask the web... (Score:2)
ISP's and Netcom suffer too (Score:1)
JediLuke
Text of CBT letter to MCI CEO (Score:2)
By Bridge News
Chicago--Aug 13--On the heels of Thursday's power outage in downtown
Chicago, which forced an early shutdown at the Chicago Board of Trade, the
exchange was forced to suspend trading again today on its Project A system. CBT
President Thomas Donovan sent a letter to MCI WorldCom CEO Bernard Ebbers,
blasting the company for its part in a string of other disruptions that have
plagued the system. MCI WorldCom is the exchange's network provider and has been
unable to cope with the crises to the exchange's satisfaction.
* * *
Donovan said today's shutdown and others in the past few weeks were a direct
result of MCI WorldCom's "catastrophic service disruptions," which have deprived
large segments of the CBT's constituents access to Project A through their
trading terminals on the system's wide-area network.
"All told, our Project A markets have been down over 60% of the time since
Project A's scheduled Thursday evening trading session last week, exposing our
members and their customers to market risk and depriving them of significant
trading and revenue opportunities," he said. "The CBOT has also experienced a
sizable loss of transaction fee revenues."
MCI WorldCom has "tarnished the CBOT's 151-year reputation as a provider of
dependable and reliable market facilities," said Donovan, adding that the
problems put the exchange in the hot seat with its federal regulatory body, the
Commodity Futures Trading Commission.
He said MCI WorldCom led the CBT to believe it would not need a contingency
plan, but the exchange would now be forced to implement one beginning with the
Project A session that begins at 1800 CT Sunday.
Under the plan, many exchange members will have to move or duplicate their
Project A operations and staffing to back up locations within the building,
entailing added costs and hardships.
Last week, Project A suffered a shutdown after MCI began to upgrade its
communications network and an outage occurred at a switching center. The company
provided assurances it would try harder to restore customer confidence.
"As a result of MCI WorldCom's failure to deliver on their promises to me
early last week, the CBOT is pursuing all available remedies," Donovan said.
He said the exchange had lost all confidence in MCI WorldCom's ability to
provide reliable service and was awaiting the company's immediate response as to
how it would remedy the situation. End
Bridge News, Tel: (312) 454-3468
Send comments to Internet address:futures@bridge.com
[symbols:US;WCOM]
Aug-13-1999 17:26 GMT
Source [B] BridgeNews Global Markets
Categories:
COM/GRAIN COM/SOY COM/LIVE CAP/FOREX CAP/CREDIT CAP/INDEX COM/AGRI
COM/LUMBER COM/ENERGY CAP/STOCKS
Re:Here's a thought (Score:1)
Re:Lucent the Ostrich (Score:2)
It's usually not practical to really test complex systems under anything that approaches the real-world. MCI-WorldComm can't maintain a test environment that anywhere near mirrors the significant portion of the Internet that they service. Anything less than a test under real-world loads will not be representative of what will happen when you put it into service.
And remember, testing only demonstrates the presence of defects, not their absence.
Who's to say that Lucent is at fault here? I would guess that the same equipment is running outside of MCI and we're not hearing about problems there. I don't actually know the situation with regard to the the Lucent hardware/software. It may be that this is something that only MCI has, or something that only MCI has put under such loads.
People need to get some perspective. With the growth of Internet and bandwidth demands in general, combined with the cut-throat cost competition environment for these carriers, it's really surprising to me that we don't have a lot more failures like this one. Get over it.
Re:What nobody's said yet (Score:1)
Re:What could take this down... (Score:1)
Re:How it happened. / Why is it not fixed? (Score:1)
http://www.cmpco.com/aboutCMP/powersystem/black
Man Oh Man (Score:1)
I don't even think about it when I logon
But then, when the logo keeps churning
And something smell's like its burining
I ask the question,
Where's my connection?