Ask Slashdot: How Transparent Should Companies Be When Operational Technology Failures Happen? 93
New submitter supernova87a writes: Last week, Southwest Airlines had an epic crash of IT systems across their entire business when "a router failure caused the airlines' systems to crash [...] and all backups failed, causing flight delays and cancellations nationwide and costing the company probably $10 million in lost bookings alone." Huge numbers of passengers, crew, and airplanes were stranded as not only reservations systems, but scheduling, dispatch, and other critical operational systems had to be rebooted over the course of 12 hours. Passenger delays, which directly attributable to this incident, continued to trickle down all the way from Wednesday to Sunday as the airline recovered. Aside from the technical issues of what happened, what should a public-facing company's obligation be to discuss what happened in full detail? Would publicly talking about the sequence of events before and after failure help restore faith in their operations? Perhaps not aiming for Google's level of admirable disclosure (as in this 18-minute cloud computing outage where a full post-mortem was given), should companies aim to discuss more openly what happened and how they recovered from system failures?
Router Failure? (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
A router losing data? Ya, right.
"Teacher, I couldn't do my homework last night, because our dog ate the router."
"And the cat ate my gym suit."
Re: Router Failure? (Score:1)
"Teacher, I couldn't do my homework last night, because our dog ate the router."
"And the cat ate my gym suit."
Pics from your pets' veterinarian or it didn't happen.
Re: (Score:2)
Re: (Score:1)
So, they're insulting themselves for assuming their customers are largely clueless?
Who Cares? Let the Market Decide (Score:2)
Re: (Score:2)
These people are just incompetent and should be fired immediately. Up time is a solved problems if you engineer well.
Re: (Score:2)
I hope they maintain their aircraft better than their computer systems and terminals. It sure doesn't inspire confidence.
These people are just incompetent and should be fired immediately. Up time is a solved problems if you engineer well.
You can be relatively sure they do the absolute bare minimum like every company does with their "cost centers".
People have been convinced they want cheap everything so the MBAs turn the screws down really good..
Re: (Score:2)
People have been convinced they want cheap everything so the MBAs turn the screws down really good.
Funny how "cheap" never seems to apply to their salaries and bonuses though.
Re: (Score:2)
Funny how "cheap" never seems to apply to their salaries and bonuses though.
Of course not! They are adding value and if they weren't sufficiently compensated they would take their talent elsewhere!
Re: (Score:2)
Funny how "cheap" never seems to apply to their salaries and bonuses though.
You might want to talk to some SWA employees about that. SWA is notorious for low pay and stingy benefits. That is one way that they keep their fares low.
Re: (Score:2)
Well how many institutions actually have a proper IT Infrastructure?
Having a company having to embarrassingly show their inadequacies when a problem effects customers should be public. Because if their value in their IT systems is so low should we trust them with their data? Also being self serving: That embarassment will make sure they hire more staff and put more money in IT funding.
Re:Router Failure? (Score:5, Insightful)
You haven't worked in enterprise IT for long, have you? An embarrassment like this will make them flog their existing staff harder, insist on more metrics to measure performance, more boxes on the audit form to tick, more mandatory unpaid overtime. But little chance they'll actually spend more money on the IT cost center.
Re: (Score:2)
That embarassment will make sure they hire more staff and put more money in IT funding.
You haven't worked in enterprise IT for long, have you? An embarrassment like this will make them flog their existing staff harder, insist on more metrics to measure performance, more boxes on the audit form to tick, more mandatory unpaid overtime. But little chance they'll actually spend more money on the IT cost center.
Sadly true in most cases.
In most organizations whose businesses are not IT related, the only time anyone powerful enough to do anything about it cares about IT is when it breaks.
When things are working, what do we need more IT expenditures for?
When things are not working, why did we spend what we did?
I wish I had never gotten into this "career".
Re: (Score:2)
That embarassment will make sure they hire more staff and put more money in IT funding.
You haven't worked in enterprise IT for long, have you? An embarrassment like this will make them flog their existing staff harder, insist on more metrics to measure performance, more boxes on the audit form to tick, more mandatory unpaid overtime. But little chance they'll actually spend more money on the IT cost center.
Depends. What situations like this do is put a pretty firm dollar amount on the failures that IT asks for $X to mitigate/prevent. That way, next time they ask for $400k for something to avoid a $2M problem, they can ask in a language that upper management understands, and have memorable evidence to back them up.
Sad that management won't trust the expensive experts they hire, but sometimes it takes an expensive lesson for them to learn (just sucks that the customers usually get screwed in the process).
Re: (Score:1)
Re: (Score:1)
More interestingly (IMO), given that such a path is chosen, how could we appropriately encourage truthful and full disclosure? i.e. what's in it for the business?
"You are required to under th
Re: (Score:3)
Only those that are required to by some laws, regulations, or an external body, such as financial or health care institutions. Everybody else cheats on infrastructure and recovery equipment. They figure the odds of 2 or more related apps going down and then combine recovery/fail over systems and equipment into one. The cell phone industry is one of the worst, they barely have enough equipment to handle 50% of their KNOWN customer load and just figure that not everyone is going to try and make a call at the
Re: (Score:2)
The overhead to make a cell phone connection is much higher, and requires a point to point connection, thus the network will fail under that load. SMS or text messages don't require the same direct connection but can be bounced about and their lifespan is much greater, thus they can be made when the cell network is unavailable for normal calls. The downside is that text messages can go stale and the protocol does not include reception acknowledgement. Thus you can send 3 messages and get only part of them
Re: (Score:2)
Shouldn't, but could.
They could be running a converged network infrastructure with storage and networking fabrics meshed and a run-amok router starts blasting out broken routes and it cascades into storage access problems and crashes compute nodes that lose their storage, resulting some borked databases and crashed apps.
I'd guess it was designed to not do that and we don't know if it was a config error, some HA feature that didn't work, some other bug or what.
Re: (Score:1)
Router failure? No.
Windows 10 upgrades... Don't worry. That all ends today
Re: (Score:1)
Isn't it the free upgrade which ends today? Surely the harassment (from the intimidating blue re-spawning rectangle) will need to ramp up significantly to continue to drive adoption?
Re: (Score:2)
Router failures shouldn't cause loss of data in any appreciatable amount. Enterprise level organizations should have automatic failover routers in place. This was far more than a simple router failure...so the real question should be: should companies be allowed to lie to their customers about major technical issues?
Why is that so hard to believe? I can see how a core router failure could lead to data loss. Router failed, backup router didn't work (if you don't do failover testing, you don't know that your backup is really ready to take over the load: "oh oops, the firmware on the fiber interface card on the secondary crashes under heavy load"), split-brain leads some systems to fail over to secondary, now you've got transactions hitting primary and secondary databases concurrently, possibly with no way to reconcile th
Re: (Score:2)
Wow. You've never done this for a living, right?
Network failures in such a complex, distributed system cause unexpected problems. 'Router' should be thought of in this scenario as 'data flow device', and of course data is at risk.Transaction rollbacks, session timeouts, more than these cause problems that become data loss events.
Not that SWA is without blame here. At work we had a server failure that impacted thousands of virtual machines. What was a storage failure became a corruption failure, and ultimat
Re: Router Failure? (Score:2)
SWA is just an acronym. Luv refers to an old motto / ad campaign, notice they have a heart shape in most of their imagery.
Their IATA code is. WN, possibly from an old parent airline...
Re: (Score:2)
"SWA is just an acronym."
Found online years ago:
A.c.r.o.n.y.m. - A contrived reduction of nomenclature yielding mnemonics.
Re: (Score:1)
why can't people accept that things happen? (Score:2)
Re: (Score:2)
the only thing public pressure does is cause the company to spend more money in redundant hardware which mostly sits unused and raises prices
My redundant hardware is constantly in use and I have nowhere near the budget of these big boys. Redundant doesn't always mean active/passive. Routers are especially easy to run active/active, hell that's way the Internet routes traffic. BGP/EIGRP will take care of the routing.
But I suspect that this wasn't a simple router failure. A router failure wouldn't require other systems to be rebooted.
Re: (Score:1)
the only thing public pressure does is cause the company to spend more money in redundant hardware which mostly sits unused and raises prices
My redundant hardware is constantly in use and I have nowhere near the budget of these big boys. Redundant doesn't always mean active/passive. Routers are especially easy to run active/active, hell that's way the Internet routes traffic. BGP/EIGRP will take care of the routing.
But I suspect that this wasn't a simple router failure. A router failure wouldn't require other systems to be rebooted.
Try dealing with systems where a lot of the code was outsourced to India. A unicorn farting in Uzbekistan might cause things to get FUBAR.
Then the low-cost O&M folks you hired live in WindozeWorld where rebooting is step 1, 2, 3, 4, 5, all the way up to step 9153 in troubleshooting.
Re: (Score:2)
A router failure wouldn't require other systems to be rebooted.
Unless they are on some older os / old mainframes / have apps that got stuck / have stuck sessions / the systems where due for an os update and reboot.
Re: (Score:3)
i've been delayed because of weather, engine troubles, etc and i'm still alive and happy.
Many (most?) can accept that things happen but, there are limits. Yet, most can't accept a lack of information and being outright lied to. Air line in general have a very poor public perception of telling the truth about why delays are occurring. Weather and "Acts of God" are one thing, we can check the weather, many can sort out things like "Oh, there's a thunderstorm in Ohio, why is that affecting my flight in Colorado... Oh, my plane in trying to leave Ohio..."
With air lines implementing various cost
Re: (Score:1)
This - sometimes lack of information is ridiculous. I missed the connection for my 5th of 6th flights a few weeks ago. I'm pretty sure they knew I would miss it before we took off - but they kept being optimistic. And then when I landed, no more flights anywhere that night.
Now, had I known, even 5 minutes before the flight, I would have rebooked to a flight basically anywhere other than Detroit.
Same trip - Hertz Gold had a reservation for a car at 1230 am - I get to the lot at 1am - no cars for any gold
Re: (Score:2)
But how would informing you of the issues have been better for the company, at least short-term?
Take your Hertz example; not knowing the extent of the problem, you waited around until you got a car - and Hertz got paid. Had they told you that no vehicle would be available until 2AM, you would have taken a taxi and Hertz would have been out a rental.
Of course, long-term these attitudes can cost a company customers, who will look to their competitors rather than use a company with such poor service. But that'
As transparent as their customers demand (Score:5, Interesting)
The companies understand one thing: profit.
It depends on the volume of business and a variety of factors. For example, I was recently considering the purchase of a new automobile. There was one make which I ended up removing from consideration because their infotainment was not open for me to hack on. I felt like this was important and so I told the salesman why it was important to me and that this single factor resulted in my no longer considering any models from this manufacturer.
In another instance, a specific dealership had two different sales people contact me by phone, essentially competing with each other. I didn't like that so I didn't bother calling back either one. Several days later I received a form inquiry from the general manager (certainly an automated message). I took the time to respond, explaining that I wouldn't be doing business with them because of the poor coordination of their salesmen's activities. If I already talked with one and explained what I needed in a vehicle, why was another going to call me and try to make me go through all that again?
Granted, these are different examples, but I make this small effort in the hopes that it will either improve the situation for the person who comes along after me or for myself the next time. Of course, the larger the organization, the less likely this is to have an effect. I expect that the GM of the dealership with two salesmen could possibly do something based on my feedback. I fully expect nothing to change from the manufacturer of the car with the closed infotainment system. However, if 10,000 customers all told different dealers the same thing or bothered to write to the manufacturer directly, then something might change.
Southwest and other airlines are by necessity very large companies. If you tell a booking agent something it is almost certain no manager will hear of it. But, if you contact the execs directly, perhaps if there is a VP of customer service or an ombudsman, contact that person and let them know that you value openness and that you are specifically avoiding giving them your business because of their lack of it. If they hear this from enough people, the will get the message: we are losing out on business because of our approach to blah blah blah.
So, bottom line: companies should be as transparent as their customers demand. If you, the customer, don't demand then they won't know and won't make any change.
Too glib (Score:5, Insightful)
The companies understand one thing: profit.
That's not true. Companies and the people that run them understand more than just profit. I defy you to find a single person in a company who cannot comprehend something other than profit. To claim that profit is all they can understand is absurdly untrue. But there is a nugget of truth in what you say. What is true is that companies and some (not all) of those who run them have a strong tendency to focus on profits excessively, particularly short term profits. They do this to the detriment of all else including the long term health of the company sometimes. It's too glib to say that companies only understand profit but it is fair to say that companies tend to focus on it too hard at times and make bad decisions as a result.
A well managed company has to consider things like the health of their community, the well being of their suppliers, the trust of their customers, etc. All these things sooner or later will impact profits so if company focuses excessively on near term profits then in the long term they will likely be worse off and so will all those who depend on the company - customers, suppliers, community, shareholders and employees.
Definitions (Score:2)
Actually, that is the definition of a company.
No it is not. The definition of a company is "an 'artificial person', invisible, intangible, created by or under law, with a discrete legal personality, perpetual succession and a common seal. It is not affected by the death, insanity or insolvency of an individual member."
A company is a term that refers to a variety of types of organizations [wikipedia.org]. Some types of companies are explicitly not concerned with profits at all. Perhaps you've heard of non-profit companies [wikipedia.org]? Those are a thing you know.
From the linked
Re: (Score:3)
Investors.
Also the implied definition of profit it very limited. There are other kinds of profit than 'make as much money as possible.' But the investors are always taking on some of the risk and responsibility for a profit.
Large investors like Venture Capitalists or Mutual Funds may only be interested in how to generate money since they don't really have any other value they can derive from a random busi
Re: (Score:1)
Re: (Score:2)
Not necessarily. You *can* influence a large organization *if* they think they can make money off your idea. For example, I was at the NYC Auto Show with my GF -- and we were sitting in one of those giant Fiat-500 looking half-SUV things. And it had a glass roof which we liked.
But my GF complained that the vehicle was too tall, making it difficult for her to get snow off the roof (she prefers station wagons to SUVs, and very few manufacturers make wagons anymore).
So, we were talking with the booth rep, and
Re: (Score:2)
This brings to thought several things:
1. It's a "sun roof" so keeping it clear in the winter isn't exactly a common use case.
2. The area covered by the sun roof relative to the rest of the roof is relatively small. Putting heating wires in the glass only won't do much good, since there will still be snow on the rest of the roof. You still have to clear the rest manually, often to comply with local laws about clearing vehicles of snow before driving.
3. You don't really need to see out the top of the veh
Re: (Score:1)
Re: (Score:1)
Uhh, surely they'd just tag-team the customer with a combination of:
* increase price before bartering begins (check, built into the dealership model)
* the appearance that they're operating against each other rather
Re: (Score:1)
That's great - I agree that targeting someone who might care/have a stake in profits/has power to effect change is probably
No simple answer (Score:2)
Aside from the technical issues of what happened, what should a public-facing company's obligation be to discuss what happened in full detail?
There is no simple single answer to this question. It's going to be circumstance dependent. In many cases a lot of transparency will be helpful and appropriate. In other cases it probably won't matter much and in a few cases it might even be counterproductive though I expect that would be uncommon. If the problem is something like a security problem that will take time to resolve, immediate transparency might do more harm than good in some cases. But in general people are pretty forgiving if they under
They Solicit (Score:2)
Not a router failure and not a surprise (Score:4, Interesting)
I worked IT in the airline industry for over 20 years and that happening does not surprise me.
In many cases the systems are old, the software is not well maintained, and management does not understand how critical it is to the operation of the company. Many airline/aircraft companies have outsourced their IT to Managed Service Providers under the guise that "We are an airline, not an IT company." In doing so management negotiated the contracts, not IT, and the contracts are crap. No clauses for upgrading systems, no clauses for management of software patching, and one such contract, that I have read, guaranteed a 98% uptime. Yes, it really was 98% and not 99.999%.
In almost all cases once IT was outsourced, they not only eliminated their IT department, the added rules that stated they could not hire IT people as it was all outsourced and they had no need of them. The companies I have worked for have haired me with odd titles to avoid such rules.
Redundancy is, in many cases, non-existent. Equipment is aging and starting to fail, and there is no plans or projects in the works to update them. Heck, one company I know of is still running on computers that were purchased in 1995.
When projects are put forward with proper HA, network fail over, SAN, etc. They get cut in cost cutting measures to the point that they are unrecognizable. A great example is an upgrade to an Oracle server that I was working on. The original upgrade plan was to deploy an HA pair with back end SAN on a dual 10g fail over connection. After it was cut it ended up being a single dual proc windows system with internal drives running on a 1g connection. It has already crashed multiple times and each time has brought the company to a standstill.
In this day and age, companies need to realize that they run on IT. If your IT infrastructure fails, your company comes to a halt and you loose money!
Re: (Score:3)
In this day and age, companies need to realize that they run on IT. If your IT infrastructure fails, your company comes to a halt and you loose money!
It is amazing to me how many companies do not realize this until they suffer a major outage.
I like to think that it is because many senior managers are still of the generation that did not grow up with computers being a central part of their lives/businesses.
However, the generation coming up now that has had that is almost as bad but in the other direction -- they want to use computers / tablets / phones / the cloud etc. for everything and are very quick to adopt new devices /apps / services... with very li
Re: (Score:2)
If you forget to shave or shower on a particular day, should you be required to post that to your Facebook page or wear a billboard sign all day decrying your lack of hygiene?
If you don't shower, it will be apparent to everyone around you; no need for a sign or Facebook post.
Civil Engineering Lesson (Score:2)
Re: (Score:2)
What do we do when buildings and bridges fail, or when an aircraft falls out of the sky? We should do something like that. In a more enlightened age, we'd have the NTSB-equivalent for massive IT failures.
Having some minimum standards that are required for both the systems themselves and the people working on them would be great.
IT needs to get much more professional but that would mean doing battle with all the companies/lobbyists who like IT being cheap, easily outsourced (in the short term), and with a bunch of cowboys who don't want to unionize or group themselves under a true professional group in any way.
Re: (Score:2)
"IT needs to get much more professional but that would mean doing battle with all the companies/lobbyists who like IT being cheap, easily outsourced (in the short term), and with a bunch of cowboys who don't want to unionize or group themselves under a true professional group in any way."
Indeed, this is the problem. There are way too many cowboy sysadmins and coders out there who wouldn't even think about minimum standards for work product. I think the only way to solve it would be to have a purely politica
Yes, but it depends on the level of danger (Score:2)
Legal Requirements (Score:2)
Some industry sectors have legal requirements to disclose technical failures that could impact their operating bottom line. For example, think about Section 404 of the Sarbanes-Oxley Act.
Other requirements are driven by locations - for example California was the first US State to require formal disclosure if a company lost unencrypted client data.
The bottom line is that, for a growing number of industry sectors, legislative j
"transparency" to build confidence (Score:1)
I accept anything as long as it's truthful (Score:2)
Accidents happen. And only people who don't work make no mistakes. So if anyone claims he never makes mistakes, you have found the slacker.
People are surprisingly willing to cut you some slack if you admit mistakes, apologize and offer them some token compensation. Provided that they don't happen too often and that it cannot be considered malice or gross negligence.
Also, what you offer in compensation should be in sync with your mistake. Handing out a free trial that marketing has been throwing about left a
Indians, prolly. (Score:5, Interesting)
"Outsourcing partner" in Bangalore must have screwed up.
On Indian outsourcing, here's a war story. When working with Fokker, the Dutch aerospace company, I was sent to Bangalore to emit a final judgment on an outsourcing firm there. On the second day, needing to go to the toilet, I lost my way in the building. Trying to find the loo, I walked by an empty cubicle (the cubicles had large glass panes in them). On the table lay a blueprint. Being an engineer, I couldn't refrain from looking at it. The name "Areva" was printed all over it, Areva being a French constructor of nuclear power plants. It soon became clear to me that those st***d Indians had left the blueprint of an import safety valve in a current nuclear reactor design, unsupervised, on a table in an empty cubicle, and that anyone could walk in on it. I took a picture with my cell phone and sent it to Areva - after having stood there, for a test, for about 10 minutes. Nobody turned up. Anyways - some high-up security guy there went ballistic; on the phone, he thanked me and explained to me the kind of mayhem that blueprint falling in the wrong hands could have caused. (Needless to say we at Fokker immediately cut ties with that Bangalore company.)
Re: (Score:1)
On my last gig my employer had an Indian company handling Linux systems administration chores. When they created our virtual machines they left behind scripts that contained the credentials to access another customer's infrastructure. The other customer was a big, national bank.
Most big companies are willing to accept the increased risk incurred by farming out work to incompetent offshore people if the price is right.
That's how they do.
Case-by-case basis (Score:1)
Those customers that got "badly burned" are going to want to know that you've learned your lesson.
If the event hit the press or word got around to your target customer base, you'll need to convince them that it won't happen again (I'm looking at you, Southwest Airlines).
If your industry is one where the failure could cause death or injury if it happened again - even to a competitor - then you have a moral and possibly legal obligation to "go public" within your industry so they can learn from your experienc
All Backups Failed? (Score:1)
Re: (Score:2)
They prolly outsourced. See my "Indians/Bangalore" post above.
Skeletons in the closet? (Score:2)
All large organizations have some messy aspects of their internal IT. The longer the organization has existed, and the larger and more diverse it is, the worse it gets. There was a story a couple days ago circulating about a Citibank employee (NOC engineer or something like it) that was able to stop most network traffic by removing the configs in a few key routers. (Turns out he was upset about a bad review he had just been given.) If a network were properly designed with no choke points, no SPOFs, etc. it
Re: (Score:2)
Although I rarely respond to ACs, here is a thumbs-up. This sounds like the cleverest in-road I personally would consider pursuing in case I were hired to do an investigation.
Be Transparent (Score:2)