Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
United States

A Corrupt File Led To the FAA Ground Stoppage (cnn.com) 176

According to CNN, the Federal Aviation Administration system outage on Wednesday has been traced to a corrupt file. From the report: In a statement late Wednesday, the FAA said it was continuing to investigate the outage and "take all needed steps to prevent this kind of disruption from happening again." "Our preliminary work has traced the outage to a damaged database file. At this time, there is no evidence of a cyberattack," the FAA said. The FAA is still trying to determine whether any one person or "routine entry" into the database is responsible for the corrupted file, a government official familiar with the investigation into the NOTAM system outage told CNN.

When air traffic control officials realized they had a computer issue late Tuesday, they came up with a plan, the source said, to reboot the system when it would least disrupt air travel, early on Wednesday morning. But ultimately that plan and the outage led to massive flight delays and an unprecedented order to stop all aircraft departures nationwide. The computer system that failed was the central database for all NOTAMs (Notice to Air Missions) nationwide. Those notices advise pilots of issues along their route and at their destination. It has a backup, which officials switched to when problems with the main system emerged, according to the source. FAA officials told reporters early Wednesday that the issues developed in the 3 p.m. ET hour on Tuesday.

Officials ultimately found a corrupt file in the main NOTAM system, the source told CNN. A corrupt file was also found in the backup system. In the overnight hours of Tuesday into Wednesday, FAA officials decided to shut down and reboot the main NOTAM system -- a significant decision, because the reboot can take about 90 minutes, according to the source. They decided to perform the reboot early Wednesday, before air traffic began flying on the East Coast, to minimize disruption to flights. "They thought they'd be ahead of the rush," the source said. During this early morning process, the FAA told reporters that the system was "beginning to come back online," but said it would take time to resolve. The system, according to the source, "did come back up, but it wasn't completely pushing out the pertinent information that it needed for safe flight, and it appeared that it was taking longer to do that." That's when the FAA issued a nationwide ground stop at around 7:30 a.m. ET, halting all domestic departures.
The source said the NOTAM system is an example of aging infrastructure due for an overhaul. "Because of budgetary concerns and flexibility of budget, this tech refresh has been pushed off," the source said. "I assume now they're going to actually find money to do it."
This discussion has been archived. No new comments can be posted.

A Corrupt File Led To the FAA Ground Stoppage

Comments Filter:
  • by NoMoreACs ( 6161580 ) on Thursday January 12, 2023 @02:13AM (#63201960)

    3:30 AM EST sounds like the nadir of passenger air traffic over the U.S.

    That means it should be finished rebooting by around 5:00 AM EST. That gives an hour before morning passenger traffic begins to verify everything is working ok.

    Their timing makes no sense.

    • by gtall ( 79522 )

      And you know so much about their internal systems how?

      • And you know so much about their internal systems how?

        I'm just taking the FAA's estimate about how long it would take to Restart and Verify proper operation of the system (about 1.5 hours).

        The rest is just being able to tell time.

  • by The Rizz ( 1319 ) on Thursday January 12, 2023 @02:17AM (#63201968)

    The FAA is still trying to determine whether any one person or "routine entry" into the database is responsible for the corrupted file.

    I see that Little Bobby Tables is at it again.

  • by pbry4n ( 7208566 ) on Thursday January 12, 2023 @02:19AM (#63201970)
    So, the FAA system is a central point of failure that can be taken down with one corrupt file, costing untold millions of dollars in expenses for airlines. Gentlemen, start your lobbies.
    • oracle DB? on windows server?

    • It's also a bit ironic given the shit handed out to SouthWest just a week or so previously, and which is still in the news. Not that it makes up for SW's screwup, but at least their mistake didn't bring the entire system crashing down - only their own. And wasn't the FAA even talking about investigating the incident, decrying how this was unacceptable? Pot, kettle, black.

      Obligatory XKCD (you know the one), except substitute the FAA's creaking software infrastructure for the Nebraskan's project as that sm

      • by lsllll ( 830002 )
        The SW outage affected over 5000 flights. This one only about 700. So it was actually smaller.
        • Not sure where you're getting that 700 number, but it seems to be way off. Maybe a preliminary figure? Recent numbers are here:

          Approximately 10,103 flights within, into or out of the United States have been delayed and 1,343 have been canceled as of 9:20 p.m. ET Wednesday, according to the flight-tracking website FlightAware.

          https://www.cnn.com/us/live-ne... [cnn.com]

      • It's also a bit ironic given the shit handed out to SouthWest just a week or so previously

        It's not a contest. We shouldn't tolerate incompetence from airlines just because the FAA is worse.

        If we make the government the "gold standard" for competence, we are doomed.

      • by gtall ( 79522 )

        What a pointless remark. Organizations have technical failures, it happens. What do you expect us to have, several competing organizations doing what the FAA does? Or maybe you'd like to adequately fund the FAA to have multiple redundant systems. The R's will take away your magic Ayn Rand decoder ring for that one.

        • by Entrope ( 68843 )

          What does he expect? Probably that if an organization is going to criticize one airline for having an outdated IT system that cancels and delays a boatload of flights, it could ... not make all airlines rely on its own outdated IT system that prevents flights from taking off. The FAA doesn't need to have multiple systems to do this, they just need to have one reasonably redundant and resilient system. They're an aviation agency, not a dot-com, so they don't need to tolerate the kind of organizational fai

          • > "move fast and break things" attitude.

            I'm pretty sure this kind of attitude wasn't involved in any way...

        • Damn! Where are my mod points when I need them???
      • I wonder if Southwest gets to fine the FAA.

        Really, the entire industry should sue the FAA and DoT for gross negligence in their failure to make sure the systems they require airlines to depend upon actually work. They lost a ton of business and I suspect will be forced to compensate customers for the government's screw-up.

    • by bookwormT3 ( 8067412 ) on Thursday January 12, 2023 @02:56AM (#63202012)

      since this seems to be the only serious thread:

      Interesting that the corrupt file was in the backup system too. I'm always dubious about "live mirror" data architecture being the only backup, as it only protects against hardware failure, not bad operations/deletions/whatever. Better to do RAID6 or better on the primary, and have a "warm", not "hot" alternate. Bonus if you can select how old your data is (1 day, 1 hour...)

      Also interesting that the reboot takes 90 minutes. I've seen stuff like this before, where between slow computer, maximum auditing enabled, required-to-run applications, and bloated apps, a reboot takes along this timescale (might have been under an hour), but for a single system of critical importance... Yikes. Sounds like the lowest bidder needs to be told to put a faster computer under it, and maybe parallelize some things. (even if it's some old AS/400 or something, there's upgrade paths for anything, even if you need an emulator)

      • by bradley13 ( 1118935 ) on Thursday January 12, 2023 @03:32AM (#63202056) Homepage

        You make good points. I also just want to toss out that a system as critical as this ought to have two or even three levels of failover. The first is the "live mirror", to handle hardware problems, power failures, etc.. The second level should be an offline system that automatically takes over with older, "known good" data - i.e., a snapshot taken from a time when the system was known to be running correctly (likely 24 hours). You could easily argue for a third level that uses no live data at all, but simply provides absolute minimal functionality.

        I also like the approach taken by a certain critical organisation I am acquainted with: Upper management has access to a switch that disables the primary systems. They flip that switch without notice, at least once a year, just to see that the failover systems actually do come online seamlessly. Obviously, that doesn't catch everything (might not have caught this data corruption problem), but it does ensure that the failover systems are taken seriously.

        The 90 minute reboot only becomes a problem, if the failover doesn't work. As it didn't in this case.

      • On Linux, you can use rsnapshot to maintain an arbitrarily long history of iterative backups that doesn't take up all that much space. Unchanged files are hard-linked, which takes up little space other than the initial version save; changed files get uniquely saved, so you can review/access a file's versions over the history of your saved backups.

        • On Linux, you can use rsnapshot to maintain an arbitrarily long history of iterative backups that doesn't take up all that much space. Unchanged files are hard-linked, which takes up little space other than the initial version save; changed files get uniquely saved, so you can review/access a file's versions over the history of your saved backups.

          Today on Linux you'd just use zfs and do actual snapshots, you can just cd into them and look around.

          • by jabuzz ( 182671 )

            Explain to me how you are going to recover from a file system snapshot if the file system itself is corrupted? You can't so you need to do other sorts of backup. rsnapshot is frankly a bit crap in this regard, though it is better than a poke in the eye. There are much better commercial solutions that I would expect to be used in this scenario.

      • Why does it sound like this critical system is served from a single machine in a closet somewhere? Multi-homed load balancing is a thing! A reboot of a single machine should never interrupt the service.
        • by ghoul ( 157158 )
          This wasnt a hardware issue, it was a data issue. Multi homing would not solve that
          • This wasnt a hardware issue, it was a data issue.

            The root problem was data. But that was exacerbated by slow hardware. There is no excuse for a mission-critical system to take 90 minutes to reboot.

      • by v1 ( 525388 ) on Thursday January 12, 2023 @09:05AM (#63202536) Homepage Journal

        The "live backup" with the corrupt file was not there to serve as a backup for corrupt data, it was there to serve as a failover for failed HARDWARE. And to eliminate the need to wait 90 minutes for a reboot after swapping a card or something.

        While they haven't gotten specific about the file, it was probably a database table with a nonsense entry in it that broke a query or two. And that table was probably containing "live data", so you can't just restore yesterday's backup. The table had to be rebuilt from live data gathered from other sources. This takes time, and the system can't be running while it's being rebuilt.

        Af for startup time, it's very likely running in a VM, and those can take time to spin up, especially when there are other things like sql servers and data sources that also have to be brought back up. If you have to pull the plug on a VM server, it can easily take HOURS for large VMs and SQL tables to run their error checks and self-repairs. It "only" took 90 minutes in this case because it was probably running on a dedicated system that only had to bring up a few things from a hard shutdown. They probably had the bad table rebuilt before the VM was fully back up and waiting for it.

        Of couse a lot of this can only be speculated about without more information, but you get the idea. This wasn't merely a "my router's not working, unplug it and plug it back in so I can get back to work" kind of fix.

        We had a VM server go down hard at work last year. It was running quite a few things in including our main SQL server. Took 45 minutes to get the VM itself fully back up, and several hours to restore the main systems. There were minor systems that weren't back up for several days due to the time requires to check and repair everything. We thought we had everything in place to insure a clean, orderly shutdown of our SQL server and VM, but that's not the easiest thing to test fully, and we got to learn a few new lessons that day. (we lost commercial power, it switched instantly to battery UPS, genny kicked on and we thought everything was fine,, but we weren't alerted that the transfer switch failed, and by the time the low battery alarm on the UPS went off we didn't have enough time for a clean shutdown)

        • "The "live backup" with the corrupt file was not there to serve as a backup for corrupt data, it was there to serve as a failover for failed HARDWARE."

          Obviously. So he question is: where was their backup for corrupt data?

        • "And that table was probably containing "live data", so you can't just restore yesterday's backup. "

          Yesterday's data isn't too relevant for today's flight conditions. The only reason to have a backup is for historical reasons.

        • by jabuzz ( 182671 )

          Really we had an unexpected shutdown of a ESXi server at work last year and all the VM's were restarted on other ESXi servers in the cluster and nobody noticed a dam thing. I was panicking like mad for a couple of minutes until I realized the incident had gone unnoticed. For a silver lining, we got a real world live test that out HA actually worked as designed.

          Basically, if you can't just yank the power on your VM server and everything just keep trucking in very short order then your system is badly designe

      • by tlhIngan ( 30335 )

        Interesting that the corrupt file was in the backup system too. I'm always dubious about "live mirror" data architecture being the only backup, as it only protects against hardware failure, not bad operations/deletions/whatever. Better to do RAID6 or better on the primary, and have a "warm", not "hot" alternate. Bonus if you can select how old your data is (1 day, 1 hour...)

        Also interesting that the reboot takes 90 minutes. I've seen stuff like this before, where between slow computer, maximum auditing enab

  • by Joe_Dragon ( 2206452 ) on Thursday January 12, 2023 @02:25AM (#63201978)

    why did I call my airport drop_all_tables?

  • There is never enough funds until the shit hit the fan and then all of a sudden there is a budget again. You just never know how many quarters you can find when you dig deep in between the couch cushions.

    • by ebob9 ( 726509 ) on Thursday January 12, 2023 @03:18AM (#63202036)

      There is never enough funds until the shit hit the fan and then all of a sudden there is a budget again. You just never know how many quarters you can find when you dig deep in between the couch cushions.

      “Never let a good crisis go to waste”

    • by ShanghaiBill ( 739463 ) on Thursday January 12, 2023 @05:46AM (#63202254)

      There is never enough funds until the shit hit the fan

      Not true. Congress gave the FAA billions for the NextGen Air Transport System [wikipedia.org], and the FAA squandered the money on a massive project that was started in 2007 and isn't expected to be ready until 2030, after 23 years of development. Few expect it to be completed even by that late date, if ever.

      The ObamaCare rollout showed that with government IT projects, success is correlated with LOWER funding. Oregon spent more than any other state on ObamaCare, over $300 million, hired Oracle for the implementation, and their rollout was a disaster. Kentucky spent $3M (1% as much), and their site worked flawlessly on day one.

      The secret to Kentucky's success was simple:

      1. 1. Use their own employees so they have skin in the game rather than contractors that profit from chaos.
      2. 2. Give the project to a small team with an established track record of completing projects on time and on budget.
      3. 3. Starve them of resources so they have no choice but to stick to a clean and simple design.
      4. 4. Never, ever, ever, ever even THINK about using Oracle as a contractor.

      The problem with the FAA isn't "not enough funding," but the opposite, too much funding, leading to an overly ambitious project that has little chance of success.

      • Re:Budgetary Issues (Score:4, Informative)

        by Nkwe ( 604125 ) on Thursday January 12, 2023 @12:28PM (#63203084)
        Oregon resident here. What's even more tragic about this is that there was a local software company here in Portland that basically had the system needed to solve this problem, but wasn't tapped or considered as a contractor. This system (with a proven track record) could have been adapted and implemented for a cost similar to what Kentucky spent.
  • Define "corrupt" (Score:5, Insightful)

    by xevioso ( 598654 ) on Thursday January 12, 2023 @03:20AM (#63202040)

    Entering something incorrectly into a database using some standard system the FAA would use to do so isn't "corrupt". Downloading or transferring data or a file, and having that transfer interrupted, resulting in the loss of or alteration of data, is "corrupt" in the technical sense, where I come from.

    Sometimes we are far far far removed from the description given from the IT folks involved to their superiors; I'm sure the issue would be pretty clear to us in technical terms, but a public-facing answer such as "a file got corrupted" is fucking nonsense.

    • Entering something incorrectly into a database using some standard system the FAA would use to do so isn't "corrupt". Downloading or transferring data or a file, and having that transfer interrupted, resulting in the loss of or alteration of data, is "corrupt" in the technical sense, where I come from.

      Sometimes we are far far far removed from the description given from the IT folks involved to their superiors; I'm sure the issue would be pretty clear to us in technical terms, but a public-facing answer such as "a file got corrupted" is fucking nonsense.

      Agree.
      This sounds more like a copy/paste from a Word or Excel doc into a database.

  • To make good on that promise of preventing this from happening again, they probably would need to throw everything away and redesign and re-implement using actually competent people and high-availability paradigms. Who thinks that will happen? Yeah, me not either.

    • throw everything away and redesign and re-implement

      No. No. No. This is exactly the wrong thing to do, and it is WHAT THEY DID.

      The "redesign" was started in 2007 and is scheduled (haha) to be completed by 2030: Next Generation Air Transportation System [wikipedia.org].

      The system that failed is the old system that no one works on anymore, so they can focus on the "redesign".

      Things you should never do: Rewrite software from scratch [joelonsoftware.com]

      • by haruchai ( 17472 )

        "Things you should never do: Rewrite software from scratch [joelonsoftware.com]"

        Elon Musk disagrees, often

      • by gweihir ( 88907 )

        It really depends. But if they think they can design a system in 23 years, they are bound for failure. That time is far too long.

        The problem with these projects and why they all fail is that nobody working on it actually wants to make it work. They all just want to get rich. If you redesign and re-implement with the right people and with duct-tape over "management's" mouths so they cannot interfere, this can work and usually does. Problem is it is almost never done this way.

      • by StormReaver ( 59959 ) on Thursday January 12, 2023 @11:23AM (#63202890)

        Things you should never do: Rewrite software from scratch

        Never trust any advice that deals in absolutes. Where I work, we had an old mission-critical system written in a language that only one active programmer knew, and he was nearing retirement, and that ran on a system that had been obsolete for years.

        I took on the task of rewriting it from scratch using modern technology. It took a few years to get it to production, but it was well worth it. He retired a couple years before then, so the one person we had left who understood how to use the old system supported it until I was done with the rewrite enough for it to be usable.

        I replicated (and improved) all the features of the old system, and added a whole slew of new features that the users had only dreamed of for the decades the old system was in place.

        Sometimes rewriting old systems from scratch is the only option.

  • "did come back up, but it wasn't completely pushing out the pertinent information that it needed for safe flight, and it appeared that it was taking longer to do that."

    TFS was a decent nutshell of what happened, but one glaring question remains; why brag of having a backup system when it sounds like no one actually used it to test out the reboot theory, and perhaps prepare for this unforeseen delay?

    Tends to explain why no one is giving them a budget. Do they know what to do with it?

    • by lsllll ( 830002 )
      Perhaps the last few backups were corrupt and the last good backup was from a couple of days ago, but they didn't want to go that far back. Either way, they should have had the transactions chronicled so that they could have gone to a good backup and played forward, but that may have taken too long.
    • In all fairness, I'm not sure how one would test this sort of backup system, considering that commercial aviation under normal circumstances is a 24/7/365.25 business.

      I guess maybe bring the backup system up, but don't connect it to whatever mechanisms actually relay the information to the pilots or whomever else needs it? Then make sure the outputs of the live and backup systems match 100%?

      Of course that kind of testability would need to be architected in, possibly after the fact.

      But in the financial worl

    • why brag of having a backup system when it sounds like no one actually used it to test out the reboot theory

      Backup systems should only be used to practice recovery procedures and for actual recoveries. Validating theories and planned changes should be done on a test/dev/qa system. Otherwise you risk the backup system not being available when you need it.

  • Could someone please tell me why I can only see 4 comments for this story, even though my threshold is set to (-1)?
  • Maybe next time FAA wants to give shit to Southwest , they should remember Southwest has Bobby Tables as a management pilot and Southwest may just schedule him on a flight and request NOTAMS to his id.
  • by mallyn ( 136041 ) on Thursday January 12, 2023 @04:43AM (#63202138) Homepage
    Folks

    I hear the song singing, "Memories of the fun we had a long time ago!"

    I took over a system admin role (person gifted us an extended middle finger and left very fast).

    And the database corrupted. Yes, I know that a production relational database is supposed to have redo logs and archived redo logs.

    Except that the backed up archived redo logs each had nothing but and EOF character.

    I looked at the backup script. The cp $archive /backup/server/$archive was commented out.

    I immediately wet my pants.

    We called in Oracle and they did their magic and were able to restore the database.

    Before I let the customer use it, I corrected the script and another script that had and error. I performed a backup. I then did a restore of that backup to another machine we had in spare. The database worked. Only then did I let the customer use their database.

    Please. Please. Check *all* scripts when you take over a role.

    Perhaps all this FAA drama is due to an accidentally commented out line in an itty bitty script file.

    May the tears flow from my eyes and form a new Niagara Falls as I cry with these memories.

    With Deep, Endearing, and Compassionate Love

    Mark Allyn

    • Ow.

      I've thankfully never had that happen in a production system. Mostly via pure luck.

      But this is why we Test. Backups. Regularly.

      And beyond that . . we test disaster recovery procedures in general, by, occasionally, and usually at night when not much else is happening, simulating the worst realistic failure scenario we can imagine, and making sure we can bring backup systems and data online within our agreed RTO (recovery time) and RPO (recovery point, meaning essentially how much data may be lost with

  • Really, any application or database that can be "fixed" with a reboot is really bad. At most an application or DBMS process shutdown and restart (ideally at most, gracefully) should be all it takes, but you should know if that will fix it or not before you do it. Either something is 2rojf with the OS, or something is wrong with the application. I'm not surprised it wasnt fixed by that course of action. While I'm on a roll... I'm just guessing that IBM is the outsourcer for those systems. Pay the cheapest j
  • We don't know where in the system this "file" was. Could have been a database system file, or just a database record, or an operating system file, or some actual file in the NOTAM application.

    They make a point of suggesting the "corruption" could be due to bad user input via whatever interface let's them update this file. So it could have started with operator error.

    Whatever it is, they had a hot backup where the corruption was replicated.

    What seems to be missing is a change control process where the softwa

    • by mccalli ( 323026 )
      > "Or maybe it was syntactically valid input, and only "corrupt" in the sense that during processing, buggy data could sneak in to cause cascading errors."

      This is my guess. Likely not corruption in a technical sense, more likely invalid input - maybe a field left blank and it couldn't process nulls, or maybe a non-ASCII character it couldn't cope with, or, or...

      That kind of thing. "Corrupt file" sounds more like the layman explanation than a technical reality.
  • A fat-finger mistake';DROP TABLE FLIGHTS

  • So the FAA can spend endless time and money on frivolous stuff like remote id and drone regulations, but for serious stuff like actual aviation nothing is being done.

  • IN minutes. There will be logs so they can backtrace the offending transaction. Lets just read what the head DBA Admin says.
  • a significant decision, because the reboot can take about 90 minutes

    90 minutes? They should have upgraded to systemd.

  • Classic speed/fragility trade-off.

    Stream SQL duplication whenever you can afford it.

    Shipping binaries over the wire means that when a solar storm flips a bit on the primary your secondary is fucked too.

    Quickly of course.

  • Incompetence, corruption and greed

    Government jobs should not be protected from incompetence. Responsible managers should be fired.

    Contracts to fix this should be fixed price with de-incentives for failure. The new Air Force One contract is a good template.

  • It's an old Microsoft Access database that hasn't been compacted in years, isn't it?

    • Or how about dBase, where each table is a separate file, with no support for transactions. Maybe somebody opened the file in Notepad and typed something wrong!

  • Correct if I'm wrong, but isn't the correct way to manage critical databases to have mirrored databases on hot standby so you can cut over to a mirror should the currently active database become corrupted? And of course RAID 1 mirroring of the drives as well? Of course, doing it right costs more money for the redundancy, but in the long run, isn't that cheaper than GROUNDING ALL FLIGHTS FOR A DAY???
  • Was it a real database like SQL Server or Postgres? Or was it MS Access or even older, like Paradox or dBase? In some of those really old databases, you could edit the files in Notepad, but it you typed something wrong, you were toast.

  • by groobly ( 6155920 ) on Thursday January 12, 2023 @02:25PM (#63203402)

    How is it that the primary and the backup had the same failure mode? Does not sound like they are independent.

Somebody ought to cross ball point pens with coat hangers so that the pens will multiply instead of disappear.

Working...