US FAA Adopts New Safeguards After Computer Outage Halted Flights (reuters.com) 25
The Federal Aviation Administration (FAA) told lawmakers Monday it had made a series of changes to prevent a repeat of a key computer system outage that forced a nationwide Jan. 11 ground stop disrupting more than 11,000 flights. From a report: The FAA said it has implemented "a one-hour synchronization delay for one of the backup databases. This action will prevent data errors from immediately reaching that backup database." The FAA also said it "now requires at least two individuals to be present during the maintenance of the (messaging) system, including one federal manager."
Turn keys... (Score:2, Funny)
Turn keys on my mark.
MARK - system goes down....
Re: Turn keys... (Score:2)
Cmon guys mod up
So things now break one hour later? (Score:2)
Yeah, great fix. And I have seen "managers" making sure things were done right. Most of the time they were not even looking and used their phone. One also directly told me he had no clue how things worked on the tech side.
Re: (Score:2)
Re: (Score:2)
You think? I do not. When something goes this badly wrong, the rot sits deep on all levels.
Re: (Score:2)
Yeah, I would think having a peer check your work as you do the deployment would be way more effective at preventing errors.
Re: (Score:2)
Not only you. Anybody that does this competently uses two _experts_ for anything like this because otherwise it is pointless.
Re: (Score:2)
And almost any major upgrade must have a "rollback" plan in place before it is approved. The Change Request checklist at our location included a rollback requirement where we had explain the plan to reverse any changes we made, the procedure to do it, who could do it, and the timeline to accomplish it.
If the risk assessment said the rollback could be completed in two hours by saving the data before it was upgraded and restoring it if problem occurred, it was probably approved. If the procedure was to back
Re: (Score:2)
In an environment like this? Most definitely.
what about fixing input to cut down on data errors (Score:2)
what about fixing input to cut down on data errors?
Was it an maintenance error? Or was it an data entry error that crashed the system?
Re: (Score:2, Interesting)
It was a maintenance error. They inadvertently synchronized a bad parameter to the backup environment before pushing a code update. Ergo, when the system 'restarted' after the code update it imploded - and the hot-spare backup environment was already corrupt.
I imagine restoration was - rollback to known good point, move the transactions forward (skipping the errant one), and restarting things... hence the rather lengthy outage window.
Also... the system is rather old. Adding to the issues.
How do you take
Oldness has nothing to do with it (Score:2, Interesting)
This system would've been outperformed by a fleet of teletypes producing hardcopy. Central goes down, you still have the hardcopies and you just hope that nothing really vital needs to go out as a NOTAM in the few hours you need to crank the thing back on. If it does, well, you still have radio. And you'll have a much smaller and more localised problem.
There used to be a German weather subscription service for (general) aviation that used ISDN and built a Fido-style network on top of that. Should the centr
Highly Available (Score:2)
Even modern systems are Highly Available, not "Always Available" - nothing is 100%, but there are procedures and designs that protect and minimize the impact of mistakes. Change Requests and "manager over the shoulder" aren't really going to improve anything except creating paperwork and job angst.
I'm not generically a fan of remaking working systems, but there does come a time when the requirements or outcomes have shifted far enough that throwing a system away and restarting is the better option. In my
Re: (Score:1)
I'm not generically a fan of remaking working systems
Yep, that's the reason COBOL is still in use.
Re: (Score:2)
Change Requests and "manager over the shoulder" aren't really going to improve anything except creating paperwork and job angst
I would say this is just how it goes in US Government jobs, but at the same time I've seen what happens when someone tries to explain what you just said to a Congressional sub-committee. The folks in Congress are the ones that make the "eyes over your shoulder" happen way more often than not. Tell you the truth, I think it's just projection.
It's time to make ATC private, similar to Canada (Score:2)
https://transportationtodaynew... [transporta...aynews.com]
Re: (Score:2)
Privatization destroyed the UK's railways, gutted French EDF, took to the brink of collapse Prague (CZ) water utilities...
Privatization can go both ways.
Re: (Score:2)
System is not "mission critical" (Score:2)
Re: System is not "mission critical" (Score:2)
You make a good point, but NOTAMs still have their place. Flights between two uncontrolled airports comes to mind, although of course if you're paranoid like me you'll call your destination beforehand in that case. The wider question is of course 1. How can you fuck up something so simple and 2. How easy and cheap would it be to replace the current system with a web-based simple text solution?
Upgrading equipment. (Score:2)
8-inch floppy drives to 5.25in.
They revoked access (Score:2)
As they should have done long ago, and as every dev shop that deploys software critical to the operation of their companies. It's way too common for developers to have free rein on production systems. If the *only* way to make changes to a production system, is to create a deployment and test the deployment on a test system first, these kinds of issues will happen far less frequently.