Britain's Conservatives Scrub Speeches from the Internet 234
An anonymous reader writes news of an attempt to erase a bit of history. From the article: "The Conservative Party have attempted to delete all their speeches and press releases online from the past 10 years, including one in which David Cameron promises to use the Internet to make politicians 'more accountable'. The Tory party have deleted the backlog of speeches from the main website and the Internet Archive — which aims to make a permanent record of websites and their content — between 2000 and May 2010."
Wrong (Score:4, Informative)
This is not accurate. Speeches made in Parliament are archived in Hansard for a start. And there is no changing that.
Re:Doesn't that kinda defeat the point of the arch (Score:5, Informative)
It's not even a takedown request. IA will honor robots.txt totally and retroactively - if they have 10-15 years of archived data at a specific domain (or subdirectory on that domain), and someone puts up a robots.txt disallowing them access, not only will they refuse to archive it going forward, but they will remove all previously archived material from being viewable (I hope they don't actively remove it from their archive, but merely stop making it available).
Re:And let's not forget why: (Score:4, Informative)
There have been more ideologically-oriented governments, from post-War Labour to Thatcher.
They might not keep all their promises, and all ideologically is strongly diluted with practicality, but they're not the vacuous bunch of cunts we have in Britain today. (They're not that different from Blair, of course, but Blair had a more representative set of people to steer him.)
Re:Archive.org should not respect robots.txt (Score:2, Informative)
He misspoke. He meant to say they bought up domains and then used robots.txt to subsequently censor the site (including all older content)
Re:Doesn't that kinda defeat the point of the arch (Score:5, Informative)
I apologize for my mistake. Until just a few minutes ago, I was unaware that the Internet Archive agrees to RETROACTIVELY honor a robots.txt file. So once a robots.txt file restricts access to content, they voluntarily remove access to previously archived content from the archive. Here's the related item from their FAQ [archive.org]:
Some sites are not available because of robots.txt or other exclusions. What does that mean?
The Internet Archive follows the Oakland Archive Policy for Managing Removal Requests And Preserving Archival Integrity
The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt (you will see a "robots.txt query exclusion error" message). Sometimes a web site owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come accross a "blocked site error" message, that means that a siteowner has made such a request and it has been honored.
Currently there is no way to exclude only a portion of a site, or to exclude archiving a site for a particular time period only.
When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent.
Re:Deleted from the Internet Archive? (Score:5, Informative)
No, they put robots.txt on their website and the Internet Archive respects robots.txt retroactively. If they had 20 years worth of data archived from one domain, and someone puts a robots.txt on the domain, all 20 years worth of data is removed from the archive. Whether it's actually deleted or hidden is unknown, but I hope it isn't deleted.
Re:Archive.org should not respect robots.txt (Score:4, Informative)
I also have a link to a realtime predicted tide generator which takes about 30 seconds to calculate the information it sends back. Before I hacked in a robots.txt to cover it (it's on a different port than the normal web server and thus, according to the robot operators, a completely different website than the one that already had a robots.txt to stop them) one "helpful" robot indexer latched onto it and was sending ten requests per minute. Nice of them to throttle themselves, yeah, when they were running my apache server up to the connection limit (keeping other people from using the site) and driving the load up so the site was useless for anyone local.
So any suggestion that any robot operator ignore robots.txt should be shouted down as the complete nonsense it is.
People have used robots.txt to buy up domains they want to censor.
You can't buy a domain with a robots.txt. Once you own the domain, you have the right to "censor" it all you want, including the use of a robots.txt that bars all robots. But if your goal was to "censor" a website, just stop running an HTTP server. That's much better than any robots.txt in keeping everyone from getting your stuff.
Re:Wrong (Score:3, Informative)
Sigh.. 'Wrong in what way?
This was the archive of speeches, not just the parliamentary ones; but all the ones at election rallies and conferences too.
For instance; ToryBoy recently sat in a big gold chair and ate a 4 course meal along with all his rich chums in the Guildhall, London. He then stood in front of an gilded podium and made a speech [theguardian.com] in which he told all the little people that they had not worked hard enough and that austerity is now here to stay.
This speech is exactly the sort of one that will never appear on Hansard, and in a few years may well be the sort of thing Tory spinsters will hope to make 'disappear'.
Only partially. (Also a wishlist.) (Score:5, Informative)
Indeed this is ridiculous that the IA would retroactively remove stuff though as you say hopefully just disable access instead.
I think the archive actually does just suppress access rather than purge the actual data, so they can again display it once copyright runs out (if it ever does...).
I also think the point is that newbies may not know about robots.txt and that even an experienced webmaster might accidentally allow access to something private long enough for it to get archived, or receive and honor a takedown notice, so this allows the correction of the error.
It's an 'archive' and should reflect how stuff 'was' at the time; legalities of that obviously being quite murky and hard to defend against expensive lawsuits, but still.
That's why. They have limited funds and need them to buy more disks and stuff, not fight lawsuits. If the choice is not display some stuff or go broke and not display anything, the choice is also obvious.
I wish, though, that they were able to detect when a domain changed hands and not honor robots.txt requests retroactively past the boundary. IMHO a new owner is a new web site that happens to have the same name.
Especially: I wish domain name parking sites didn't put up robots.txt files that cause the archive to immediately purge/hide the previous owners' content. I've lost access to a lot of content from dead sites that way. (It also keeps the owners from rescuing their old content if they don't have personal backups.)
Re:Archive.org should not respect robots.txt (Score:5, Informative)
As I understand it, Archive.org uses robots.txt to censor old, already captured data. That's a serious flaw in an archive IMO.
Re:And let's not forget why: (Score:5, Informative)
Here's a nice little summary of all those broken promises, pledges and outright deceit. [newstatesman.com]
It gets worse (Score:4, Informative)