FAA Says Contractor Unintentionally Caused Outage That Disrupted Flights 83
The Federal Aviation Administration has said that a contractor working for the air-safety regulator had unintentionally deleted computer files used in a pilot-alert system, leading to an outage that disrupted U.S. air traffic last week. From a report: The agency, which declined to identify the contractor, said its personnel were working to correctly synchronize two databases -- a main one and a backup -- used for the alert system when the files were unintentionally deleted. The FAA said it had taken steps to prevent a recurrence of the outage in the system used for collecting and distributing the alerts, known as Notice to Air Missions, or Notams. "The agency has so far found no evidence of a cyberattack or malicious intent," the FAA said late Thursday in a statement outlining preliminary findings in its continuing investigation. The FAA said that it had made necessary repairs to the system and has taken steps to make it more resilient.
Contractor not to the same standards as FAA (Score:5, Funny)
If IT Contractors were held to the same standard as the FAA, this would not have happened.
Obviously SlashDot editors are not held to any standard, since the post links to itself instead of source.
Re: (Score:2)
Re: (Score:2)
If IT Contractors were held to the same standard as the FAA, this would not have happened.
Obviously SlashDot editors are not held to any standard, since the post links to itself instead of source.
But that's the point of contracting... To get paid a fortune to not take any responsibility.
The problem is really government funding. To get a FTE (Full Time Equivalent) employee they have to justify the position to all and sundry (doubly so as it's a public sector job and everyone is looking at public sector "waste") but if you hire a contract that comes out of an operating budget, no need to justify the position, get approvals, only the basic checks required and all of that can be outsourced to an agen
Not quite that simple (Score:1)
The FAA is government, it makes the rules. It requires flights not happen if NOTAMs aren't available. It fucks up and NOTAMs are not available. Then blames a contractor. No, they hired the contractor, their fuckup.
They ought to be liable for the hundreds of cancelled flights to have a proper appreciation of the impact of what they do and so not skimp on those "unintentional outside contractors". This stinks far too much of "we've divested ourselves of all responsibility by throwing money at a lowest bidder
Re: (Score:2, Interesting)
Because this system is like 30 years old and developed based on the best-practices way of doing things at the time it was designed and developed. The FAA, NTSB, Airlines, Pilots Unions, and contractor have all been on record, prior to the incident, that the NOTAM system is woefully out of date, has been for some time, and needs updating.
This issue is no different than any other large organization saying about some antiquated system that is outdated, but functional - the cost of replacing it is more than (g
Re: (Score:2)
30 years ago it was best practices to not have only a single backup for mission-critical systems.
Re: (Score:2)
I am glad you think like that. Now let's see you lobby your representative and senators in Congress to adequately fund the FAA.
The mismanagement was brought to you by Congress, do you see them taking any blame?
Re: Not quite that simple (Score:2)
adequately fund the FAA
That won't help. Adequately funding the FAA (or any gov't department) is just a nice way of saying " more money for the contractors". If you think some poor schmuck at the FAA is going to sit down and develop a replacement for NOTAMS, you have no idea how fast industry lobbyists will mobilize to defund them again. Been there, done that.
Re: (Score:2)
The usual problem in these situations is that the government agency has neither the knowhow to do it themselves, NOR the knowhow to properly supervise a contractor to do it. That is why big projects like the FBI's everything system failed, why the Calif finance system failed, and the list goes on and on.
Re: (Score:2)
The government agency is the domain expert (by definition). And what's to stop them from hiring people with the software skills? Other than the lobbyists pushing for defunding, of course.
Re: (Score:2)
The other (obvious) option is to sanitize their inputs so they don't get corrupted files uploaded, and make the system fault-tolerant if they do get a corrupted file from a problem in the storage layer, along with raised visibility of what the actual issue is so they don't take multiple hours to diagnose it and fix it.
This seriously shouldn't have taken longer than like 30 minutes at most to diagnose and fix, and it's shameful that it took as long as it did.
This is what comes with using software well beyond
Re: Shit happens (Score:2)
Re: (Score:1)
It was his first day [youtube.com].
Re: (Score:2)
Interesting, though, that a single action could take the entire system down. They might want to rethink their architecture a bit.
Re: (Score:2)
That would require having funding to develop a whole new system. And the hardest thing to do in government, is get funding for anything. Especially when you have a bunch of muppets in Congress that are obsessed with spending cuts to the point of ransoming the entire economy to get their way.
Link to the actual story (Score:5, Informative)
https://www.reuters.com/world/... [reuters.com]
The link in the summary just points back to this slashdot page.
Re: (Score:3)
Hey! Don't knock it - that's one way to generate click traffic.
They probably got a whole 30 extra views from the people around here that actually try to read the articles!
Re: (Score:2)
Mod parent funny! :-)
A working link (Score:3)
https://www.marketwatch.com/st... [marketwatch.com]
and a Wall Street Journal link (paywall)
https://www.wsj.com/articles/c... [wsj.com]
Government IT (Score:2, Interesting)
The way the federal government does software I'm surprised this shit doesn't happen more frequently.
In the experience of myself and my friends and colleagues who have had to interact with federal anything, something like 60 pr 70% of the mental effort is "compliance" and the remainder is actually coding and business logic.
A buddy of mine worked for a network equipment manufacturer. A while back they tried to sell to some federal agency. They had to hire a former military dude who knew nothing about tech but
Re: (Score:1)
Why is OLD software automatically bad? Serious question. Why cannot software simply be DONE if it does what it's supposed to do?
Migration of software mostly unmodified to newer hardware is possible to if need be.
In the mainframe world, 30 year old software is NEW software.
Re: (Score:2)
I've always wondered this myself. Why is software never complete?
Re: (Score:3)
Murphy's Law: If software is being used, it will have to be changed. If software is no longer being used, it will have to be documented.
Seriously, your house is also never "complete." You might have to fix plumbing that worked fine before, or you might want an electrical outlet added, or a gas stove where there is a connection only for electric. The appearance becomes dated, you no longer want that paneling from the 70's. These types of factors affect software too. Anything that is worth using, will need co
Re: (Score:2)
Software does not breakdown in the same way as mechanical things like plumbing, software does not corrode or degrade over time due to the elements or wear and tear.
Wanting new features is a thing, sure. But does that really apply in critical safety systems in the same way it does in SaaS? No, of course not.
Re: (Score:3)
No, not in exactly the same way as plumbing, but there are very real similarities.
I just ran across one today. In Access, when you try to import a CSV file that is encoded as UTF-16, it fails. It doesn't properly interpret the BOM character at the beginning of the text file, and assumes that CSV is always ANSI. At one time in the past, this might have been a valid assumption. But times have changed, even Notepad understands the BOM character and the encoding options. A workaround is to open the CSV file in
Re: (Score:2)
Security alone makes sure software is never really "done". Software has reached a high level of complexity with very little time to perfect it. Care to point to any other industry with a comparable complexity level that reached its level of "done" in less or equal time, while at the same time constantly reinventing itself?
Re: (Score:2)
A lot of software doesn't survive interaction with users. As it turns out, users do very stupid things, and if the software engineers don't anticipate the length, width, and breadth of possible stupidity, you will ultimately discover edge cases of stupid that are unanticipated. That's when "bad things" start to occur.
Patches and updates usually incorporate fixes for the stupidity that could not be adequately predicted or anticipated.
Re: (Score:3)
Code rot.
The software may work, but environmental factors change, that were not anticipated in the original version. Or requirements change.
The best way to visualize this is to think about a very old house. 100 years ago, houses were not always built with an indoor bathroom. They certainly didn't have central air conditioning. The original construction of the house might have been fine and up to code, but times changed, code changed. You won't be able to sell that old house without retrofitting it to bring
Re: (Score:1)
This is bullshit rationalisation. Code does not rot. So don't say it does, for that's misrepresenting the situation. Software does not fail. [niquette.com] If it's sensitive to things it should not be sensitive to, then it was broken from the start, despite seeming to work. If it stops working because you changed the things upon which it depends that it should depend upon, you deliberately did gone and done broke the software. Don't do that. Or replace the software when you're pulling its rug.
Oh, and sure, plenty houses
Re: Government IT (Score:1)
Code doesn't rot but it can have dependencies. In fact, any computer program that interacts with the outside world beyond stdin and stdout has dependencies. And those dependencies change, even if the requirements and inputs stay constant.
Here's a nice example: before Linux 2.6.24, allocating a block of memory in kernel mode on x86_64 would implictly allocate cached memory. After 2.6.24, the memory was explicitly uncached (and dead slow) and a new call (ioremap_cached) was introduced to give you cached memor
Re: (Score:2)
Well, it's mostly to do with people's expectations.
Software from a different era just has different constraints on it - they were much more worried about things like super-expensive RAM, CPU load, and expensive storage than we are today, because RAM, CPU, and storage are almost limitless in current environments. You have more computing power in your pocket than the mainframes this software was written for had in the era it was written, so there's far less quality checking and logging in old software, which
Re: (Score:2)
You aren't watching all the computers/servers/employees/contractors the the USA Federal Government has. I've seen mishaps like this during my IT days happening with Fed servers on a regular basis. It happens on servers that monitor Fed insurance, firearms, energy, etc. quite consistently. Just not on the level of effect that would make the headline news.
Re: (Score:1)
You'll be wanting to than St. Reagan for the state of government software. He was the one who reasoned that those nice Beltway Bandits should have the agencies forced to spend money on them. In that environment, the only control the agencies have are regulations.
Re: (Score:2)
There are a handful of quite large consulting companies who specialize in this kind of business. The engineers are working on the project that was won last year after years of effort, and they will switch to the new one in 5 years once it is ready to start. A small outfit does not have that kind of pipeline, nor the lobbyists to ensure they get the contract.
Re: (Score:1)
No evidence of a cyber-attack or malicious intent (Score:2)
Nope, just ordinary incompetence.
This is why all deployments need to be automated, and tested by deploying via automation to a test server. Deployments are a high risk operation, and need to be tested as thoroughly as the software changes themselves.
One of the first lessons I learned... (Score:2)
Delete permission should only be given to admins. It's too easy to rm -rf * and folks that don't know better can inadvertently delete critical files.
Re: (Score:2)
well the synchronization process may need the permission to delete / overwrite any file that the DB uses.
And it seems like the backup sync process is to copy over the DB files and not use the IN DB backup / sync / DR process.
Now there DB maybe to old to have an in DB synchronization / DR system.
Re: (Score:2)
It's probably an ancient database like dBase, where each table was an individual file, and there was no "database backup" mechanism. Modern databases are maintained in a single file for the entire database. That doesn't seem to square with the explanation that a file was deleted, causing database corruption.
Re: One of the first lessons I learned... (Score:2)
That's no excuse. I worked with a system that dated back to the ers of dBase. Designed by Very Smart People (not CS majors, flight control engineers). The maintenance process was such that every component had a revison number as a part of its name. And nothing was ever deleted or overwritten. The update process (a series of well-tested scripts) simply generated the next revision and stored it.
Re: (Score:2)
And, developers should never have permission to directly update production servers. They should automate deployments and test the deployment on a test server, and be able to update production ONLY through the deployment automation.
Re: (Score:2)
but devops!
Re: (Score:2)
Devops is not mutually exclusive with limiting developer access to production. In fact, DevOps helps _enable_ that limitation of direct access.
Yeah, I know. There are developers who think they MUST have that unlimited access to production. Those are not the mature developers you want in a highly sensitive system.
Re: (Score:2)
I don't want any access to production. Don't need the liability, nor the hassle, nor fingers being pointed at me if something goes wrong with a production system.
But, since bugs are often data-dependent, it is often quite necessary to have access to production data or something very much like it.
Sanitized, anonymized, etc. as needed of course.
Re: (Score:2)
Fully agreed.
Re: (Score:2)
they'll get prod access when they prove me they can actually troubleshoot a modern, distributed and scalable, micro-serviced based application.
Re: (Score:2)
Why would any good developer want (write) prod access? Sure, read access might be needed to troubleshoot data-dependent issues. But every good developer I know, wants nothing to do with directly manipulating production. Touching production directly is just asking for trouble!
Re: (Score:2)
Tell us you have no idea what devops is, without telling us you have no idea what devops is.
The whole point of devops is to limit access to production systems for the people writing the code, through automating testing and infrastructure.
Re: (Score:2)
Tell us you haven't seen real world devops without telling us your head isn't where is might be
Re: (Score:2)
Aww, did the snowflake get their feelings hurt?
Re: (Score:2)
Naaa.... Just sick to death of "kewl new" crap that creates more problems than is solves in the long run.
It all looks one hell of a lot like what we used to call "job security"
Re: (Score:2)
Funny story:
Years ago, I worked on a small system for the Feds. When I inherited the system, it generated 2 or 3 problem tickets a day.
The client told me he didn't want to continue the cycle of breaking what was just fixed with every new release. So long as I did that, we'd get along fine. I corrected what I inherited as well as implementing new requirements. The system did not generate a trouble ticket for the next 5 years. Life was good.
Then a db admin, accidentally deleted a critical table for the system
My GOP Rep has been railing on this. (Score:2)
If only the privatize the already private IT Contractor. That will fix the problem!
Yeah, rebuttal will be fun.
Re: (Score:2)
Sounds like your district elected an idiot. That's unfortunate.
More unfortunate: seems lots of districts are doing that to some degree these days, so we aggregate idiots with power in Washington, and then complain when idiocy infects all levels of government.
Re: (Score:1)
Totally beleivable (Score:2)
Deployment automation (Score:3)
Many dev shops see deployment automation as a "nice to have." It's expensive in terms of setup and configuration time, and there's a temptation for developers to think "It's working on my machine, I'll just copy the files to the server and be done."
Deployment automation requires a higher level of skill than a lot of "ordinary" software development. It's risky, and can be affected by unexpected environmental factors. Because of these risks, it's essential to implement automated deployments, and to test the deployment on a test server before deploying to production. This is especially true for a system where down time is catastrophic, such as leading to an aviation ground stop.
Sure, sometimes even automated deployments fail, and in some cases a successful test deployment doesn't lead to a successful production deployment. But when this happens, and the cause is identified, the deployment can be updated to ensure the issue never happens again.
Was this incident due to a botched deployment? Who knows, the article doesn't get into that kind of detail. But if a contractor "unintentionally deleted computer files used in a pilot-alert system" that certainly seems to indicate a manual deployment was going on.
Re: Deployment automation (Score:2)
Many dev shops see deployment automation as a "nice to have." It's expensive in terms of setup and configuration time, and there's a temptation for developers to think "It's working on my machine, I'll just copy the files to the server and be done."
The way to get that implemented (and done well) is to do the development in-house. And put the people doing the development on the hook for maintaining the end-to-end process. Not just the developed application. If you are the one who supports the web page for the production line to download engineering data, but should that web page fail, you will be the one who has to hand-deliver drawings in the middle of the night, you will get it right.
Re: (Score:2)
Deployment automation requires a higher level of skill than a lot of "ordinary" software development.
What does ordinary software development have to do with anything? Have you ever deployed something on an ancient mainframe application? You think this is like kubernetes with a helm chart?
Imagine you have an old Unix system. You don't have Git. You don't have scp. You don't have a staging environment. You don't have unit tests or other automated tests. Documentation is incomplete and sparse. The system is 5 million lines of COBOL. What exactly is your plan here to set up automatic deployment in a way that w
Re: (Score:3)
Have you ever deployed something on an ancient mainframe application?
Yes, actually I have. I started my career in 1988, on a mainframe. Kubernetes or Docker is not required for automated deployments or deployment testing. Scripts _are_ required, and every system all the way back to the mainframe days, supported scripting.
Imagine you have an old Unix system. You don't have Git. You don't have scp. You don't have a staging environment. You don't have unit tests or other automated tests. Documentation is incomplete and sparse. The system is 5 million lines of COBOL. What exactly is your plan here to set up automatic deployment in a way that would prevent outages that could not be prevented with a manual deploy?
I've lived this, but it's no excuse. An environment like this means that the system has been neglected for years, maybe decades. Yes, it's going to be problematic. But it's not impossible.
That Unix system may not have bash, but it does have ssh or ksh or csh
Re: (Score:2)
Another really good piece of the puzzle: A/B deployment, canary deployment.
If you've tested, built, and are ready for deployment, make sure it actually deploys and responds with some kind of health check before taking the existing code offline. And if it's not working properly, revert back to the known-good state.
Hanlon's razor (Score:5, Insightful)
"never attribute to malice that which is adequately explained by stupidity."
Re: (Score:2)
Mod parent up!
Re: (Score:3)
Botched deployments happen. If you're a developer and have never screwed something up during deployment then you are either really new, really lucky, or not a real developer.
This smells like one of those "simple" changes that will only take a second to deploy and oops, oh sh*t, did we make a backup?
Re: (Score:2)
Agreed. It also smells like a manual, untested deployment, as they "unintentionally deleted files." If the deployment were automated and tested, it's unlikely that they would have unintentionally deleted key files that would lead to a catastrophic failure.
It happened in Canada and the US on the same day (Score:2)
This wasn't a "person" fail
It was a job shop given bad procedure working on both sides of the border
I wouldn't want to name them either
They would then drop the bomb on the folks who provided the process
Re: (Score:1)
it has to be, too much of a coincidence.
Lowest bidder IT... (Score:3)
Re: (Score:3)
One additional issue with some government work is that they rebid the maintenance every few years and then a new crew comes in and finds the code is a mess, there's no documentation, etc... The lack of continuity can lead to all sorts of problems.
I recently had first-hand experience with this, was contracted on a short consulting gig to help a software consulting firm help with a federal website (not FAA though). The current system has large amounts of code with single-letter variable names, and has layer
Re: (Score:2)
Re: (Score:2)
THIS is what happens when you bid out your IT operations to the lowest bidder.
The thing is, we've forced government departments into this with budget hawkishness.
So many eyes are on full time govt positions that it's hard to hire, doubly so for the wages that they're permitted to offer. So they hire contractors for twice the salary because that comes from an operating budget that isn't under as much scrutiny as public service remuneration.
The irony is, the unhealthy focus on public sector "waste" has created a situation where it's more wasteful because too many people are looki
Suspicious (Score:1)
I didn't (and still don't) buy this "explanation" -- it doesn't make sense, especially post 9/11. I suspect there is more to the story.
Re: (Score:2)
Southwest? (Score:2)
Does this same contractor also do work for Southwest Airlines?
2 points... (Score:2)
0. File permissions. Utilize them. Properly.
1. User privileges. Use them. Granularly. Properly.
Sheesh, this isn't rocket science. Oh...
Not some old "legacy" system (Score:2)
Many commenters are assuming that the NOTAM system is old software running on old platforms. Mentioning mainframes and COBOL.
It is more likely to be a fairly modern system, using Java and Mysql, running on Linux, and developed in the last 10 years or less.
The failure is fundamentally in the design of the system (the architecture and the procedures) which is why someone was doing something like sitting at an SQL command line and fucking up a daily, totally manual procedure to synchronize the two copies (live
Re: (Score:1)
"What does sudo rm -rf do, anyhow?"