A Corrupt File Led To the FAA Ground Stoppage (cnn.com) 176
According to CNN, the Federal Aviation Administration system outage on Wednesday has been traced to a corrupt file. From the report: In a statement late Wednesday, the FAA said it was continuing to investigate the outage and "take all needed steps to prevent this kind of disruption from happening again." "Our preliminary work has traced the outage to a damaged database file. At this time, there is no evidence of a cyberattack," the FAA said. The FAA is still trying to determine whether any one person or "routine entry" into the database is responsible for the corrupted file, a government official familiar with the investigation into the NOTAM system outage told CNN.
When air traffic control officials realized they had a computer issue late Tuesday, they came up with a plan, the source said, to reboot the system when it would least disrupt air travel, early on Wednesday morning. But ultimately that plan and the outage led to massive flight delays and an unprecedented order to stop all aircraft departures nationwide. The computer system that failed was the central database for all NOTAMs (Notice to Air Missions) nationwide. Those notices advise pilots of issues along their route and at their destination. It has a backup, which officials switched to when problems with the main system emerged, according to the source. FAA officials told reporters early Wednesday that the issues developed in the 3 p.m. ET hour on Tuesday.
Officials ultimately found a corrupt file in the main NOTAM system, the source told CNN. A corrupt file was also found in the backup system. In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source. They decided to perform the reboot early Wednesday, before air traffic began flying on the East Coast, to minimize disruption to flights. "They thought they'd be ahead of the rush," the source said. During this early morning process, the FAA told reporters that the system was "beginning to come back online," but said it would take time to resolve. The system, according to the source, "did come back up, but it wasn't completely pushing out the pertinent information that it needed for safe flight, and it appeared that it was taking longer to do that." That's when the FAA issued a nationwide ground stop at around 7:30 a.m. ET, halting all domestic departures. The source said the NOTAM system is an example of aging infrastructure due for an overhaul. "Because of budgetary concerns and flexibility of budget, this tech refresh has been pushed off," the source said. "I assume now they're going to actually find money to do it."
When air traffic control officials realized they had a computer issue late Tuesday, they came up with a plan, the source said, to reboot the system when it would least disrupt air travel, early on Wednesday morning. But ultimately that plan and the outage led to massive flight delays and an unprecedented order to stop all aircraft departures nationwide. The computer system that failed was the central database for all NOTAMs (Notice to Air Missions) nationwide. Those notices advise pilots of issues along their route and at their destination. It has a backup, which officials switched to when problems with the main system emerged, according to the source. FAA officials told reporters early Wednesday that the issues developed in the 3 p.m. ET hour on Tuesday.
Officials ultimately found a corrupt file in the main NOTAM system, the source told CNN. A corrupt file was also found in the backup system. In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source. They decided to perform the reboot early Wednesday, before air traffic began flying on the East Coast, to minimize disruption to flights. "They thought they'd be ahead of the rush," the source said. During this early morning process, the FAA told reporters that the system was "beginning to come back online," but said it would take time to resolve. The system, according to the source, "did come back up, but it wasn't completely pushing out the pertinent information that it needed for safe flight, and it appeared that it was taking longer to do that." That's when the FAA issued a nationwide ground stop at around 7:30 a.m. ET, halting all domestic departures. The source said the NOTAM system is an example of aging infrastructure due for an overhaul. "Because of budgetary concerns and flexibility of budget, this tech refresh has been pushed off," the source said. "I assume now they're going to actually find money to do it."
So, don't reboot it in the dead of night? (Score:3)
3:30 AM EST sounds like the nadir of passenger air traffic over the U.S.
That means it should be finished rebooting by around 5:00 AM EST. That gives an hour before morning passenger traffic begins to verify everything is working ok.
Their timing makes no sense.
Re: (Score:2)
And you know so much about their internal systems how?
Re: (Score:2)
And you know so much about their internal systems how?
I'm just taking the FAA's estimate about how long it would take to Restart and Verify proper operation of the system (about 1.5 hours).
The rest is just being able to tell time.
Re: (Score:2)
Their timing makes no sense.
Your post makes no sense. They found they had a problem late Tuesday and tried to switch to the backup system, which also had a corrupt file. They decided to reboot in the entire system in the early morning hours of Wednesday to minimize the impact of the reboot.
Are you suggesting there was a better time, perhaps in the middle of the day when more flights are in the air?
So, define "Early Morning Hours".
I was watching TV news starting at about 5 AM Wednesday Morning, and there wasn't anything announced about a Reboot until after 7 AM EST.
So, when exactly was this decision to Reboot actually made?
Re:a reboot can take about 90 minutes! (Score:5, Insightful)
A reboot can take about 90 minutes! My God!
If you ever work on government IT, you will be astonished at how crufty some of the hardware is, and the software is even worse. The FAA has long had a reputation as one of the worst paleoware offenders.
This is how modern digital infrastructure works [xkcd.com] and government infrastructure is even worse.
As always with the government, nothing is fixed until a crisis happens and there is someone to blame. At least no one died this time.
Re:a reboot can take about 90 minutes! (Score:5, Insightful)
"As always with the government, nothing is fixed until a crisis happens and there is someone to blame. "
That's not a problem caused by the agencies, that's a problem caused by the way Congress funds them. After a crisis happens, blame must be ferreted out see what the issues truly are so that the agencies can go to Congress and plead for funds to fix the problem. Then they have to endure legions of R's and D's expressing how they themselves have immediately become experts in the systems in question and question why the funds are needed. That's how the IRS has devolved into a backward technology outfit.
But rest comfortably; the R's in the House have a bill to abolish the IRS and replace it with a sales tax so that the little people can get on with their job of supporting the rich people. Incidentally, corporations wouldn't pay any tax either, which will warm the dark cold hearts of the R's on the Supreme Court...corporations are people, you know.
However, to get their bill passed, they'd have to repeal the 16 Amendment:
"The Congress shall have power to lay and collect taxes on incomes, from whatever source derived, without apportionment among the several States, and without regard to any census or enumeration."
So maybe it won't be happening this year.
Re:a reboot can take about 90 minutes! (Score:5, Insightful)
The administration had two years where their party controlled Congress and gave them whatever they wanted. So, what you're saying is that Buttigieg either didn't bother to ask for the money or pissed it away on something else.
Re:a reboot can take about 90 minutes! (Score:4, Funny)
making sure the highways aren't racist (as if asphalt could be)
Then why is it called "blacktop"?
Re: (Score:3, Insightful)
both senate and the house were nearly deadlocked and any majority was not enough to break the 60 thresh. add in the fact that one party exists only to taunt the others (ALL others), and destroy what we currently have - its no wonder that not as much got done as the progressives wanted.
finally, there were two that were in the pockets of the extreme right and that mucked up the works, too.
there was no true majority. not with the filthybuster still in place. as long as the country is divided and the FB is i
Re: (Score:3)
This is why limiting the control of government over your stuff is important. Europe has done more to privatize these kinds of "Government" function than we have in the US.
Re:a reboot can take about 90 minutes! (Score:4, Insightful)
America privatizes healthcare and socializes package delivery.
Much of Europe does the reverse.
Postal service privatization [wikipedia.org]
Re: (Score:3)
There is no need to repeal the 16 Amendment: it gives Congress the power to collect income taxes if it wants to, it does not require Congress to collect income taxes.
If you haven't downloaded the new 1040 yet you should. The line for wage income just mutated into line 1a to 1i and is totaled on line 1z.
Schedules 1, 2, and 3 still live too.
Re: (Score:3)
Sales tax is not an income tax. Sure they could repeal the income tax, but they couldn't replace it with a national sales tax/VAT under the current constitution.
Re: a reboot can take about 90 minutes! (Score:4, Interesting)
Sounds exactly like big businesses. You would be shocked at the ancient software running all our banks
Re: a reboot can take about 90 minutes! (Score:2)
Re: (Score:3)
Sounds exactly like big businesses. You would be shocked at the ancient software running all our banks
No I wouldn't. There are decades old mainframes running in private businesses because they're rock solid, and businesses like rock solid. The difference is that businesses generally pays contractors to maintain those systems, and if someone doesn't do their job, they get fired.
Do you think anyone is going to get fired at FAA or any other federal agency? Don't answer, it was a rhetorical question. They'll get re-assigned, at worst, with no loss in pay or seniority.
I've worked in government, and you'd be stun
Re:a reboot can take about 90 minutes! (Score:5, Insightful)
"As always with the government, nothing is fixed until a crisis happens and there is someone to blame. At least no one died this time."
Plenty of corporations operate that way, for example, Southwest Airlines
Re: a reboot can take about 90 minutes! (Score:3, Insightful)
Re: (Score:2)
Re: (Score:3)
I've seen government 'servers' shipped carefully, insured, and jealously tracked that couldn't have even fetched $30 on ebay (but cost more than that to ship).
Re: (Score:3)
If you ever work on government IT
I have
As always with the government, nothing is fixed until a crisis happens and there is someone to blame
No, it only gets fixed when the legislature feels there's no one else to blame. Literally as everything was unfolding members of the US House took to Twitter playing the blame game. That's how you know this isn't getting fixed. Until we take these events and indicate to members of the legislature responsible they are on the hook for how old they've allowed these systems to get, nothing will ever be fixed. As long as we continue to play the blame game with them, we are as equally responsible for th
Re: (Score:3)
What we really need is for there to be a giant tech lobby that's interested in upgrading government systems, for a hefty profit, of course. I mean, ultimately, it would still be nickel and dimed to utter stupidity over making sure the right people made enough money off of it, but at least we'd bring government systems into the current century, slowly. Rather than just waiting for them all to collapse of themselves completely.
Our legislators listen to one thing and one thing only: money. Money from lobbyists
Re: (Score:2)
What we really need is for there to be a giant tech lobby that's interested in upgrading government systems, for a hefty profit, of course.
Do you really think that Big Tech lacks lobbyists?
Re: (Score:3)
If you ever work on government IT, you will be astonished at how crufty some of the hardware is, and the software is even worse. The FAA has long had a reputation as one of the worst paleoware offenders.
I can testify to this. My first job after graduating college with a computer science degree was to be a civilian computer programmer at a US military base. I'm not going to name the branch of the service as I actually do respect them, but it was crazy how old some of our systems were. When I was in college, we talked about punch cards as being ancient tech. Where I worked we actually had one system that still used punch cards. They did finally replace it with something more modern, but it was shoc
Re: (Score:3)
Worse, even when knowing the technology intimately, you won’t be able to tell which ones tr
Robert'); DROP TABLE students;-- (Score:5, Funny)
The FAA is still trying to determine whether any one person or "routine entry" into the database is responsible for the corrupted file.
I see that Little Bobby Tables is at it again.
Central Point of... (Score:5, Insightful)
oracle DB? on windows server? (Score:3)
oracle DB? on windows server?
Re:oracle DB? on windows server? (Score:5, Funny)
Microsoft Access
Re:oracle DB? on windows server? (Score:5, Funny)
COBOL has entered the chat.
Re: (Score:2)
Re: (Score:2)
It's also a bit ironic given the shit handed out to SouthWest just a week or so previously, and which is still in the news. Not that it makes up for SW's screwup, but at least their mistake didn't bring the entire system crashing down - only their own. And wasn't the FAA even talking about investigating the incident, decrying how this was unacceptable? Pot, kettle, black.
Obligatory XKCD (you know the one), except substitute the FAA's creaking software infrastructure for the Nebraskan's project as that sm
Re: (Score:2)
Re: (Score:3)
Not sure where you're getting that 700 number, but it seems to be way off. Maybe a preliminary figure? Recent numbers are here:
Approximately 10,103 flights within, into or out of the United States have been delayed and 1,343 have been canceled as of 9:20 p.m. ET Wednesday, according to the flight-tracking website FlightAware.
https://www.cnn.com/us/live-ne... [cnn.com]
Re: (Score:2)
It's also a bit ironic given the shit handed out to SouthWest just a week or so previously
It's not a contest. We shouldn't tolerate incompetence from airlines just because the FAA is worse.
If we make the government the "gold standard" for competence, we are doomed.
Re: (Score:3)
What a pointless remark. Organizations have technical failures, it happens. What do you expect us to have, several competing organizations doing what the FAA does? Or maybe you'd like to adequately fund the FAA to have multiple redundant systems. The R's will take away your magic Ayn Rand decoder ring for that one.
Re: (Score:2)
What does he expect? Probably that if an organization is going to criticize one airline for having an outdated IT system that cancels and delays a boatload of flights, it could ... not make all airlines rely on its own outdated IT system that prevents flights from taking off. The FAA doesn't need to have multiple systems to do this, they just need to have one reasonably redundant and resilient system. They're an aviation agency, not a dot-com, so they don't need to tolerate the kind of organizational fai
Re: (Score:3)
> "move fast and break things" attitude.
I'm pretty sure this kind of attitude wasn't involved in any way...
Re: (Score:2)
Re: (Score:2)
Really, the entire industry should sue the FAA and DoT for gross negligence in their failure to make sure the systems they require airlines to depend upon actually work. They lost a ton of business and I suspect will be forced to compensate customers for the government's screw-up.
Re:Central Point of... (Score:5, Insightful)
since this seems to be the only serious thread:
Interesting that the corrupt file was in the backup system too. I'm always dubious about "live mirror" data architecture being the only backup, as it only protects against hardware failure, not bad operations/deletions/whatever. Better to do RAID6 or better on the primary, and have a "warm", not "hot" alternate. Bonus if you can select how old your data is (1 day, 1 hour...)
Also interesting that the reboot takes 90 minutes. I've seen stuff like this before, where between slow computer, maximum auditing enabled, required-to-run applications, and bloated apps, a reboot takes along this timescale (might have been under an hour), but for a single system of critical importance... Yikes. Sounds like the lowest bidder needs to be told to put a faster computer under it, and maybe parallelize some things. (even if it's some old AS/400 or something, there's upgrade paths for anything, even if you need an emulator)
Re:Central Point of... (Score:5, Informative)
You make good points. I also just want to toss out that a system as critical as this ought to have two or even three levels of failover. The first is the "live mirror", to handle hardware problems, power failures, etc.. The second level should be an offline system that automatically takes over with older, "known good" data - i.e., a snapshot taken from a time when the system was known to be running correctly (likely 24 hours). You could easily argue for a third level that uses no live data at all, but simply provides absolute minimal functionality.
I also like the approach taken by a certain critical organisation I am acquainted with: Upper management has access to a switch that disables the primary systems. They flip that switch without notice, at least once a year, just to see that the failover systems actually do come online seamlessly. Obviously, that doesn't catch everything (might not have caught this data corruption problem), but it does ensure that the failover systems are taken seriously.
The 90 minute reboot only becomes a problem, if the failover doesn't work. As it didn't in this case.
Re: (Score:3)
On Linux, you can use rsnapshot to maintain an arbitrarily long history of iterative backups that doesn't take up all that much space. Unchanged files are hard-linked, which takes up little space other than the initial version save; changed files get uniquely saved, so you can review/access a file's versions over the history of your saved backups.
Re: (Score:2)
On Linux, you can use rsnapshot to maintain an arbitrarily long history of iterative backups that doesn't take up all that much space. Unchanged files are hard-linked, which takes up little space other than the initial version save; changed files get uniquely saved, so you can review/access a file's versions over the history of your saved backups.
Today on Linux you'd just use zfs and do actual snapshots, you can just cd into them and look around.
Re: (Score:3)
Explain to me how you are going to recover from a file system snapshot if the file system itself is corrupted? You can't so you need to do other sorts of backup. rsnapshot is frankly a bit crap in this regard, though it is better than a poke in the eye. There are much better commercial solutions that I would expect to be used in this scenario.
Re: Central Point of... (Score:2)
Re: (Score:2)
Re: (Score:3)
This wasnt a hardware issue, it was a data issue.
The root problem was data. But that was exacerbated by slow hardware. There is no excuse for a mission-critical system to take 90 minutes to reboot.
hard restarts are not so easy (Score:5, Interesting)
The "live backup" with the corrupt file was not there to serve as a backup for corrupt data, it was there to serve as a failover for failed HARDWARE. And to eliminate the need to wait 90 minutes for a reboot after swapping a card or something.
While they haven't gotten specific about the file, it was probably a database table with a nonsense entry in it that broke a query or two. And that table was probably containing "live data", so you can't just restore yesterday's backup. The table had to be rebuilt from live data gathered from other sources. This takes time, and the system can't be running while it's being rebuilt.
Af for startup time, it's very likely running in a VM, and those can take time to spin up, especially when there are other things like sql servers and data sources that also have to be brought back up. If you have to pull the plug on a VM server, it can easily take HOURS for large VMs and SQL tables to run their error checks and self-repairs. It "only" took 90 minutes in this case because it was probably running on a dedicated system that only had to bring up a few things from a hard shutdown. They probably had the bad table rebuilt before the VM was fully back up and waiting for it.
Of couse a lot of this can only be speculated about without more information, but you get the idea. This wasn't merely a "my router's not working, unplug it and plug it back in so I can get back to work" kind of fix.
We had a VM server go down hard at work last year. It was running quite a few things in including our main SQL server. Took 45 minutes to get the VM itself fully back up, and several hours to restore the main systems. There were minor systems that weren't back up for several days due to the time requires to check and repair everything. We thought we had everything in place to insure a clean, orderly shutdown of our SQL server and VM, but that's not the easiest thing to test fully, and we got to learn a few new lessons that day. (we lost commercial power, it switched instantly to battery UPS, genny kicked on and we thought everything was fine,, but we weren't alerted that the transfer switch failed, and by the time the low battery alarm on the UPS went off we didn't have enough time for a clean shutdown)
Re: (Score:2)
"The "live backup" with the corrupt file was not there to serve as a backup for corrupt data, it was there to serve as a failover for failed HARDWARE."
Obviously. So he question is: where was their backup for corrupt data?
Re: (Score:2)
"And that table was probably containing "live data", so you can't just restore yesterday's backup. "
Yesterday's data isn't too relevant for today's flight conditions. The only reason to have a backup is for historical reasons.
Re: (Score:2)
Really we had an unexpected shutdown of a ESXi server at work last year and all the VM's were restarted on other ESXi servers in the cluster and nobody noticed a dam thing. I was panicking like mad for a couple of minutes until I realized the incident had gone unnoticed. For a silver lining, we got a real world live test that out HA actually worked as designed.
Basically, if you can't just yank the power on your VM server and everything just keep trucking in very short order then your system is badly designe
Re: (Score:3)
why did I call my airport drop_all_tables? (Score:4, Funny)
why did I call my airport drop_all_tables?
Budgetary Issues (Score:2)
There is never enough funds until the shit hit the fan and then all of a sudden there is a budget again. You just never know how many quarters you can find when you dig deep in between the couch cushions.
Re:Budgetary Issues (Score:5, Insightful)
There is never enough funds until the shit hit the fan and then all of a sudden there is a budget again. You just never know how many quarters you can find when you dig deep in between the couch cushions.
“Never let a good crisis go to waste”
Re:Budgetary Issues (Score:5, Insightful)
There is never enough funds until the shit hit the fan
Not true. Congress gave the FAA billions for the NextGen Air Transport System [wikipedia.org], and the FAA squandered the money on a massive project that was started in 2007 and isn't expected to be ready until 2030, after 23 years of development. Few expect it to be completed even by that late date, if ever.
The ObamaCare rollout showed that with government IT projects, success is correlated with LOWER funding. Oregon spent more than any other state on ObamaCare, over $300 million, hired Oracle for the implementation, and their rollout was a disaster. Kentucky spent $3M (1% as much), and their site worked flawlessly on day one.
The secret to Kentucky's success was simple:
The problem with the FAA isn't "not enough funding," but the opposite, too much funding, leading to an overly ambitious project that has little chance of success.
Re:Budgetary Issues (Score:4, Informative)
Define "corrupt" (Score:5, Insightful)
Entering something incorrectly into a database using some standard system the FAA would use to do so isn't "corrupt". Downloading or transferring data or a file, and having that transfer interrupted, resulting in the loss of or alteration of data, is "corrupt" in the technical sense, where I come from.
Sometimes we are far far far removed from the description given from the IT folks involved to their superiors; I'm sure the issue would be pretty clear to us in technical terms, but a public-facing answer such as "a file got corrupted" is fucking nonsense.
Re: (Score:2)
Entering something incorrectly into a database using some standard system the FAA would use to do so isn't "corrupt". Downloading or transferring data or a file, and having that transfer interrupted, resulting in the loss of or alteration of data, is "corrupt" in the technical sense, where I come from.
Sometimes we are far far far removed from the description given from the IT folks involved to their superiors; I'm sure the issue would be pretty clear to us in technical terms, but a public-facing answer such as "a file got corrupted" is fucking nonsense.
Agree.
This sounds more like a copy/paste from a Word or Excel doc into a database.
So flimsy, fragile crap then? (Score:2)
To make good on that promise of preventing this from happening again, they probably would need to throw everything away and redesign and re-implement using actually competent people and high-availability paradigms. Who thinks that will happen? Yeah, me not either.
Re: (Score:2)
throw everything away and redesign and re-implement
No. No. No. This is exactly the wrong thing to do, and it is WHAT THEY DID.
The "redesign" was started in 2007 and is scheduled (haha) to be completed by 2030: Next Generation Air Transportation System [wikipedia.org].
The system that failed is the old system that no one works on anymore, so they can focus on the "redesign".
Things you should never do: Rewrite software from scratch [joelonsoftware.com]
Re: (Score:2)
"Things you should never do: Rewrite software from scratch [joelonsoftware.com]"
Elon Musk disagrees, often
Re: (Score:2)
Elon Musk disagrees, often
Citation?
Re: (Score:2)
It really depends. But if they think they can design a system in 23 years, they are bound for failure. That time is far too long.
The problem with these projects and why they all fail is that nobody working on it actually wants to make it work. They all just want to get rich. If you redesign and re-implement with the right people and with duct-tape over "management's" mouths so they cannot interfere, this can work and usually does. Problem is it is almost never done this way.
Re:So flimsy, fragile crap then? (Score:5, Interesting)
Things you should never do: Rewrite software from scratch
Never trust any advice that deals in absolutes. Where I work, we had an old mission-critical system written in a language that only one active programmer knew, and he was nearing retirement, and that ran on a system that had been obsolete for years.
I took on the task of rewriting it from scratch using modern technology. It took a few years to get it to production, but it was well worth it. He retired a couple years before then, so the one person we had left who understood how to use the old system supported it until I was done with the rewrite enough for it to be usable.
I replicated (and improved) all the features of the old system, and added a whole slew of new features that the users had only dreamed of for the decades the old system was in place.
Sometimes rewriting old systems from scratch is the only option.
What's a Backup for again? (Score:2)
"did come back up, but it wasn't completely pushing out the pertinent information that it needed for safe flight, and it appeared that it was taking longer to do that."
TFS was a decent nutshell of what happened, but one glaring question remains; why brag of having a backup system when it sounds like no one actually used it to test out the reboot theory, and perhaps prepare for this unforeseen delay?
Tends to explain why no one is giving them a budget. Do they know what to do with it?
Re: (Score:2)
Re: (Score:3)
In all fairness, I'm not sure how one would test this sort of backup system, considering that commercial aviation under normal circumstances is a 24/7/365.25 business.
I guess maybe bring the backup system up, but don't connect it to whatever mechanisms actually relay the information to the pilots or whomever else needs it? Then make sure the outputs of the live and backup systems match 100%?
Of course that kind of testability would need to be architected in, possibly after the fact.
But in the financial worl
Re: (Score:3)
why brag of having a backup system when it sounds like no one actually used it to test out the reboot theory
Backup systems should only be used to practice recovery procedures and for actual recoveries. Validating theories and planned changes should be done on a test/dev/qa system. Otherwise you risk the backup system not being available when you need it.
Sorry, off topic (Score:2)
Re:Sorry, off topic (Score:5, Funny)
Could someone please tell me why I can only see 4 comments for this story, even though my threshold is set to (-1)?
Corrupt file!
Re: (Score:2)
Revenge of Southwest (Score:2)
Long Lost Fond Memories Of Unix Admin (Score:3)
I hear the song singing, "Memories of the fun we had a long time ago!"
I took over a system admin role (person gifted us an extended middle finger and left very fast).
And the database corrupted. Yes, I know that a production relational database is supposed to have redo logs and archived redo logs.
Except that the backed up archived redo logs each had nothing but and EOF character.
I looked at the backup script. The cp $archive /backup/server/$archive was commented out.
I immediately wet my pants.
We called in Oracle and they did their magic and were able to restore the database.
Before I let the customer use it, I corrected the script and another script that had and error. I performed a backup. I then did a restore of that backup to another machine we had in spare. The database worked. Only then did I let the customer use their database.
Please. Please. Check *all* scripts when you take over a role.
Perhaps all this FAA drama is due to an accidentally commented out line in an itty bitty script file.
May the tears flow from my eyes and form a new Niagara Falls as I cry with these memories.
With Deep, Endearing, and Compassionate Love
Mark Allyn
Re: (Score:2)
Ow.
I've thankfully never had that happen in a production system. Mostly via pure luck.
But this is why we Test. Backups. Regularly.
And beyond that . . we test disaster recovery procedures in general, by, occasionally, and usually at night when not much else is happening, simulating the worst realistic failure scenario we can imagine, and making sure we can bring backup systems and data online within our agreed RTO (recovery time) and RPO (recovery point, meaning essentially how much data may be lost with
Rebooting, the lazy and negligent fix (Score:2)
corrupt file (Score:2)
We don't know where in the system this "file" was. Could have been a database system file, or just a database record, or an operating system file, or some actual file in the NOTAM application.
They make a point of suggesting the "corruption" could be due to bad user input via whatever interface let's them update this file. So it could have started with operator error.
Whatever it is, they had a hot backup where the corruption was replicated.
What seems to be missing is a change control process where the softwa
Re: (Score:2)
This is my guess. Likely not corruption in a technical sense, more likely invalid input - maybe a field left blank and it couldn't process nulls, or maybe a non-ASCII character it couldn't cope with, or, or...
That kind of thing. "Corrupt file" sounds more like the layman explanation than a technical reality.
It's probably something simple like (Score:2)
A fat-finger mistake';DROP TABLE FLIGHTS
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Oh, you're thinking in much too modern terms. It's probably a dBase database where each table is actually a separate file on disk, with no support for transactions. Somebody probably opened the in Notepad and typed something wrong!
Re: It's probably something simple like (Score:4, Funny)
...or ejected the floppy while the red light was on.
FAA and drones (Score:2)
So the FAA can spend endless time and money on frivolous stuff like remote id and drone regulations, but for serious stuff like actual aviation nothing is being done.
Many Lies: A DB check would have found it (Score:2)
So sloooow... (Score:2)
a significant decision, because the reboot can take about 90 minutes
90 minutes? They should have upgraded to systemd.
Binary backups (Score:2)
Classic speed/fragility trade-off.
Stream SQL duplication whenever you can afford it.
Shipping binaries over the wire means that when a solar storm flips a bit on the primary your secondary is fucked too.
Quickly of course.
Typical big-government issue (Score:2)
Incompetence, corruption and greed
Government jobs should not be protected from incompetence. Responsible managers should be fired.
Contracts to fix this should be fixed price with de-incentives for failure. The new Air Force One contract is a good template.
Let me guess... (Score:2)
It's an old Microsoft Access database that hasn't been compacted in years, isn't it?
Re: (Score:2)
Or how about dBase, where each table is a separate file, with no support for transactions. Maybe somebody opened the file in Notepad and typed something wrong!
Good database management? (Score:2)
What kind of database? (Score:2)
Was it a real database like SQL Server or Postgres? Or was it MS Access or even older, like Paradox or dBase? In some of those really old databases, you could edit the files in Notepad, but it you typed something wrong, you were toast.
primary and backup same fail (Score:3)
How is it that the primary and the backup had the same failure mode? Does not sound like they are independent.
Re: (Score:2)
They probably did gold-plate so thickly that they did not notice the rotting cardboard below that gold layer.
Re: (Score:2)
You can do both, Trumptard. You can use gender inclusive language and be competent technically.
Of course. They can and should use inclusive language.
But do they really need an "agency-wide initiative," rather than just sending a memo?
Re: (Score:2)
Re: (Score:2)
Wow, you're gullible. Or a shameless liar. The name change was purely Biden administration wokery. They rolled that in with legitimate changes (to better align with international conventions) that were responsive to the bill that Trump signed.
Re: (Score:2)
There is NO SUCH THING as a "Notice To Air Missions" - this is a woke-ism invented by the Biden boyz. The actual LEGAL definition of the term "NOTAM" is "Notice to Airmen" - what it has been for as long as it existed. Unfortunately for Transportation Secretary Buttigieg, who is trying to erase the word "men" rather than spend his time on actual productive work
[...]
all the news media ran with the newspeak definition as though they were all cloned sheep [...]
Orwell's 1984
No, the "legal term" is "Missions" because the FAA officially changed it to "Missions". But why would the public or the news media care what NOTAM used to stand for?
(I assume they amended it in the CFR; that's their Administrative perview.)
It might be stupid gratuitous PC woke-ism, because, were there really female pilots who were offended? Doesn't every other FAA publication still refer to "Airman", anyway? (Although I don't know why it just didn't say "Pilot" all over the place forever, anyway.)
But it's n
Re: (Score:2)
There are more countries in the world than the US, and they haven't changed what NOTAM means. See, for example, https://www.icao.int/safety/is... [icao.int] . The international definition is Notice To Airmen.