Slashdot is powered by your submissions, so send in your scoop

British Library To Archive One Billion UK Websites 89

Posted by timothy on Sunday April 07, 2013 @04:39AM from the why-not-one-hundred-billion? dept.

An anonymous reader writes "The British Library is to begin archiving the entire UK web, including one billion pages from 4.8 million websites, blogs, forums and social media sites. The process will take five months, with the aim of presenting a more complete picture of news events for future generations to read and learn from."

This discussion has been archived. No new comments can be posted.

British Library To Archive One Billion UK Websites

Load All Comments

Search 89 Comments Log In/Create an Account

Comments Filter:

- Presumably (Score:4, Insightful)
  
  by AliasMarlowe ( 1042386 ) writes: on Sunday April 07, 2013 @05:19AM (#43383143) Journal
  
  Perhaps they mean one billion web pages rather than web sites. It seems unlikely that the UK could host a billion web sites (even the American billion of 10^9 rather than the British billion of 10^12).
  
  Parent Share
  twitter facebook
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
    - Re:Presumably (Score:4, Informative)
      
      by Trpajzlix ( 2747079 ) writes: on Sunday April 07, 2013 @06:28AM (#43383283)
      
      Ehm, "everyone else". In Czech bilion = 10^12.
      The Brits use the same billion=10^9 as everyone else speaking english
      FTFY
      
      Parent Share
      twitter facebook
      - Re: (Score:1)
        
        by Anonymous Coward writes:
        
        I confirm... at least in Portugal and France, 1 billion is 10^12, rather than just 10^9.
      - Re: (Score:3, Insightful)
        
        by Alain Williams ( 2972 ) writes:
        
        Because of the ambiguity I usually say either ''a thousand million'' or use the SI prefix Giga. So: it will be an archive of a Giga web page. Hmmm: doesn't quite trip off the tongue, unfortunately.
        Similarly with dates. What does 10/5/13 mean ? 10 May 2013 or 5 October 2013 ? I favour the first (to know why see how I spelled 'favour'), but recognising that it can be misunderstood (by those who spell differently), I would usually write dates as 10 May 2013 - no ambiguity.
        
        Re:Presumably (Score:4, Insightful)
        
        by Tastecicles ( 1153671 ) writes: on Sunday April 07, 2013 @09:55AM (#43383813)
        
        I use YYYY/MM/DD. By extension, HH:MM:SS. Logical.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by K. S. Kyosuke ( 729550 ) writes:
        
        Just write 20130407-171547 like everybody else and be done with it.
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) * writes:
        
        We had to send some drawings to some guys in the US a while back. Rush job just to check everything would fit. First they complained that we has post-dated it, then that our dimensions were impossible small until it was pointed out that "mills" means "millimetres" and not 1/1000th of an inch (which surely should be 1/1200th, unless it was an attempt at metrification).
      - Re: (Score:2)
        
        by Livius ( 318358 ) writes:
        
        Not only that, people not speaking English use a word with a different pronunciation! And spelling! And grammatical rules!
      - Re: (Score:2)
        
        by CanEHdian ( 1098955 ) writes:
        
        This is because the English didn't use the -illion and -illiard system, just kept -illion
        Rest of Europe: million, milliard, billion, billiard, trillion, trilliard
        England: million, billion, trillion
        This is something that cannot be "fixed" other than adopting the SI system.
      - Re: (Score:2)
        
        by tehcyder ( 746570 ) writes:
        
        This website is written in English. So, for example you would see 3.1415927 here rather than 3,1415927. It is silly to quibble about how conventions are different in other languages/cultures. I wouldn't go to a Russian language website and start moaning about how the alphabet is all fucked up.
    - Re: (Score:3)
      
      by Joce640k ( 829181 ) writes:
      
      Spain uses 10^12
    - Re: (Score:3)
      
      by Carewolf ( 581105 ) writes:
      
      The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
      No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.
      - Re: (Score:2)
        
        by Tim the Gecko ( 745081 ) writes:
        
        The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
        No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.
        I think you are a little out of date:
        
        The Times counting people [thetimes.co.uk]
        New Scientist counting years [newscientist.com]
        The Daily Telegraph counting YouTube viewers [telegraph.co.uk]
        The Economist Pocket Style Book recommended 10^9 for "billion" back in 1986.
      - Re: (Score:2)
        
        by tehcyder ( 746570 ) writes:
        
        The "British billion = 10^12" went out of use in the 1970's. The Brits use the same billion=10^9 as everyone else.
        No a billion is still 10^12. That has never changed. But because Americans usually get it wrong, the British now uses the American billion when speaking about money, but the real billion when speaking about everything else. Of course billions are rarely used for anything other than money.
        No one in Britain uses billion to mean 10^12 unless they are being deliberately anachronistic, and have no interest in communicating with other people. In the UK, you would say the world population was 7 billion, for instance.
  - Re: (Score:2)
    
    by tehcyder ( 746570 ) writes:
    
    Yes, it's not like the fucking summary says it's one billion web pages or anything, is it? Oh, wait...
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  The trolls here are just getting weirder.
archive.org? (Score:5, Interesting)

by denpun ( 1607487 ) writes: on Sunday April 07, 2013 @04:58AM (#43383075)

Why not work with the good folks at archive.org and their Internet wayback machine [archive.org]?
Is it not a similar idea?
The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.

Share
twitter facebook
- Re: (Score:1)
  
  by denpun ( 1607487 ) writes:
  
  Was not able to access the article linked btw. (or parent site for that matter). /.ed already?
  - Re: (Score:2)
    
    by Shimbo ( 100005 ) writes:
    
    Report from BBC news: http://www.bbc.co.uk/news/entertainment-arts-22028738 [bbc.co.uk]
- Re:archive.org? (Score:5, Insightful)
  
  by kaiidth ( 104315 ) writes: on Sunday April 07, 2013 @05:57AM (#43383225)
  
  Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others. Part of that is because funding doesn't always work that way. You can get money for claiming that you are going to do the very first über-awesome UK archive, but your chances of receiving the funding becomes rather lower if in the very first breath you point out that somebody else has been doing pretty much this for a decade. Another part of it is: most politicians would likely want the national heritage, such as it is (jubilee celebration tweets - please...) to be held by that nation's own national library.
  I would imagine the BL have referenced archive.org work extensively, but differentiate this project with what tits in suits like to call "a compelling USP." To put it in plain English, they'll have a neat explanation that suggests that they are totally aware of previous work in the domain whilst making sure that this project looks a) different, b) excitingly new and c) contextually, better.
  
  Parent Share
  twitter facebook
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others.
    
    Where 'others' also includes people who might wish to make use of the library, but are refused admission despite a research case. Whereas all UK undergraduates are automatically granted access.
  - Re: (Score:2)
    
    by 93 Escort Wagon ( 326346 ) writes:
    
    Without wishing to offend it, the BL is a monolithic organisation that doesn't always play well with others.
    And you REALLY don't want to piss off their Rare Book Retrieval Unit!
  - Re: (Score:3)
    
    by ibwolf ( 126465 ) writes:
    
    I would imagine the BL have referenced archive.org work extensively
    They've actually worked closely with the Internet Archive for many many years. This includes commissioning IA to conduct crawls for them of government sites.
    Both the BL and IA are members of the International Internet Preservation Consortium (IIPC see: http://netpreserve.org./ [netpreserve.org.] Both are very familiar with what the other is doing in this space.
    So why not let IA do all the work? There are several reasons. Part of it is that the BL is responsible for web archiving as far as British cultural heritage is concerne
    - Re: (Score:1)
      
      by kaiidth ( 104315 ) writes:
      
      See, what you're saying is both sensible and unsurprising, but here's what bothers me: TFA doesn't acknowledge any of what you are saying. Instead, it suggests this is a novel activity, which seems ridiculous but happens for political reasons.
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  The British Library will probably use the same techniques as internet archive.
  
  Some reasons:
  * internet archive may bankrupt and the material may be lost. Government libraries may have - in theory at least - more reliable funding to preserve the material.
  * it is easier to do targeted crawling (of specific themes) using your own workers than 3rd party company
  * there are some legal matters that may make it more "illegal" for the 3rd party to do the crawling than if the government organization does it (as specif
- Re: (Score:2)
  
  by tehcyder ( 746570 ) writes:
  
  Why not work with the good folks at archive.org and their Internet wayback machine [archive.org]?
  Is it not a similar idea?
  The Internet Wayback Machine folks could use the funding and would be achieving the same purpose, albeit not in a format that the library folks might want....but they could come to agreement.
  This is specifically for UK web sites, and the British Library is a British institution funded by the British taxpayer. Archive.org is US-based and a separate entity.
Gotta love that management "thought" process (Score:5, Funny)

by 93 Escort Wagon ( 326346 ) writes: on Sunday April 07, 2013 @05:03AM (#43383085)

We had a manager, some years ago, who had the bright idea of assigning one staff member the task of printing out our entire website once a month so she (the manager) could look things up easily.

Share
twitter facebook
Data Storage (Score:2)

by Trpajzlix ( 2747079 ) writes:

How are they going to store the data? Isn`t this whole library idea about storing things for future generations if there has been a war or other mass scale destruction? So when "future generations" uncover this Babylonian/British collection of knowledge hundreds years later, they can still learn from the remains? What are they going to get from a 200 years old harddrive, covered in dust?
- Re:Data Storage (Score:5, Funny)
  
  by 93 Escort Wagon ( 326346 ) writes: on Sunday April 07, 2013 @05:08AM (#43383105)
  
  How are they going to store the data?
  They're planning to save disk space by just referencing the original page content inside of an iframe.
  
  Parent Share
  twitter facebook
- Re:Data Storage (Score:4, Informative)
  
  by Anonymous Coward writes: on Sunday April 07, 2013 @05:26AM (#43383159)
  
  BL, and other memory institutions such as archives, apply a concept, called "Digital Preservation", to the stored data. This concept, based on the OAIS model, covers all stages of storage, administration, maintenance and retrieval of these "remains".
  Hardest part of webarchiving is not storing the data but how to render it in 200 years. They also need to store the browser, but nowadays, browsers use so much different "subrenderers" such as Flash, Java, Javascript and CSS engines and whatnot to render a page, so there is also a need to archive all those subrenderers as well.
  Best known strategy to date is to create and store emulator containers or VM's with the original software so they can be emulated in the far future.
  http://en.wikipedia.org/wiki/Open_Archival_Information_System [wikipedia.org]
  
  Parent Share
  twitter facebook
- Re: (Score:3)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re:Data Storage (Score:4, Funny)
    
    by N Monkey ( 313423 ) writes: on Sunday April 07, 2013 @06:47AM (#43383321)
    
    How are they going to store the data?
    They'll use the "Cloud".
    ..., Oh, wait...
    No problems. Plenty of those in the UK.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by loufoque ( 1400831 ) writes:
  
  Hasn't each person created at least 10 websites in their lives?
You can't just do it once... (Score:5, Interesting)

by icebike ( 68054 ) writes: on Sunday April 07, 2013 @05:24AM (#43383157)

Unless you do this fairly frequently, say every 6 months at a minimum, the picture left for future generations will be muddled at best.
Its always interesting how the news changes with the passage of time, and events are seen very differently in just a few weeks.
On 9/11 I used this Adobe's web site mining software that essentially captures every link on every page of a site and builds a large web replicate in pdf form. All the links work within that PDF, and every page on the the site is preserved. I pointed it at all the major news web sites, one large PDF for each, burned them to disk, and still have them today. (Yup, I violated a boat load of copyrights).
Two weeks later I did it again. You would be astounded at the difference. Entire pages are missing, not just unlinked, but even when you look for them by URL that appeared in the first capture, you won't find them in the second. Other news sites kept the old stuff on line, but the links often disappeared from their own web pages so that the only way to find these pages was by following links from some other site.
The point is, that a snapshot of the web does very little good, unless it has some collection. Looking at the archives of a newspaper from June 6 1944, wouldn't give you much of an idea of the Normandy invasion, unless you had subsequent editions from days and months forward.
But a web site isn't a newspaper with discrete editions, it is a constantly evolving thing, and archiving it today (or any point in time) is fairly useless, but archiving it daily is largely redundant, (most stories will be the same). You can't tell which stories changed over time based solely on the dates either, so you pretty well have to grab it all.
Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive [archive.org]. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.
It seems that libraries are about the only place that can get away with ignoring copyright these days.

Share
twitter facebook
- Re: (Score:3)
  
  by El_Muerte_TDS ( 592157 ) writes:
  
  > (Yup, I violated a boat load of copyrights).
  So, you distributed the created PDFs? If you didn't, and it's still your in private collection, when how did you violate the right of creating copies?
- Re: (Score:2)
  
  by bumburumbi ( 1047864 ) writes:
  
  The National Library of Iceland has had a similar program for a couple of years. The national TLD is collected three times a year and made available via the Wayback Machine [archive.org]. The english version of the project's page [vefsafn.is] is rather terse, but according to the Icelandic version, selected pages are collected more frequently when warranted, e.g. political debates around election times. Icelandic law requires publishers to deposit copies of ther work with the National Library. This includes web pages so the libra
- - Re: (Score:2)
    
    by Tastecicles ( 1153671 ) writes:
    
    I use Backstreet. OK it's £13 after the 30-day trial, but it's bloody handy to have a full relinking of crawled content so you can pretty much pull a website, import it into a VM, and do what you want to do there. Me? I PDF what I download using Acrobat X batch conversion then run the fulltext indexing engine. Considering it's all running on a VM it ain't half fast, even if it is currently holding an index of 6 million pages.
    Oh yeah, and it runs on Linux via WinE. Not that I run it on Linux, I run it
- Re: (Score:2)
  
  by dkf ( 304284 ) writes:
  
  Why doesn't the Library simply work a deal with the Wayback Machine Internet Archive [archive.org]. They seem to have this problem fairly well thought out. Maybe they plan to do that. I can't tell because the site that wants to archive all of Britain seems slashdotted at the moment.
  I imagine that it will eventually happen, and that it will end up enriching the archive.org system when it does. Maybe it won't happen for a year or two, but when we're talking about long term preservation, that's not so important and the global nature of the internet makes it valuable (and logical) to globally coordinate the historical archives of it as well.
  It seems that libraries are about the only place that can get away with ignoring copyright these days.
  National libraries cannot ignore copyright, but they have a special position with regards to copyright law: they're explicitly empowered to retain cop
The process will take five months (Score:2)

by hcs_$reboot ( 1536101 ) writes:

They should definitely reduce the time allotted to that tea break..
- Re: (Score:3)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
I'll see your Internet Archive and raise you... (Score:2)

by Tastecicles ( 1153671 ) writes:

...typically British utter redundancy.
- - Re: (Score:2)
    
    by Tastecicles ( 1153671 ) writes:
    
    my point is (and I apologise if I didn't make it obvious) that this isn't news. IA has archived the internet, and done a fairly decent job of it. The BL is off on a "Me Too!" campaign and the BBC are all over it like it's a first.
    - Re: (Score:2)
      
      by tehcyder ( 746570 ) writes:
      
      There is still no harm in a national archiving organisation doing its job for its own country's data.
- Re: (Score:2)
  
  by tehcyder ( 746570 ) writes:
  
  ...typically British utter redundancy.
  Yeah, we're the sort of idiots who make more than one back up of important data. What's the point of that eh?
  Hint: redundancy is somethimes a very, very good thing indeed.
Wow (Score:2)

by databeam ( 867515 ) writes:

That's going to be a lot of porn!
Illegal Content (Score:2)

by wisnoskij ( 1206448 ) writes:

So will they being getting legal permission to host all of this copyrighted material.
Doesn't all the individual websites won their own content, how does archive.org even get around this?
And what about the illegal porn, cracks, hacks, and viruses?
- Re: (Score:2)
  
  by PPH ( 736903 ) writes:
  
  And what about the Elgin Marbles?
Average Web Site (Score:3)

by wisnoskij ( 1206448 ) writes: on Sunday April 07, 2013 @09:33AM (#43383723) Homepage

So the average website contains about 1 thousand pages then? That seems like a lot...

Share
twitter facebook
- Re: (Score:2)
  
  by tehcyder ( 746570 ) writes:
  
  So the average website contains about 1 thousand pages then? That seems like a lot...
  No, it doesn't. Imagine how many pages something like the BBC website has on any particular day.
  - Re: (Score:2)
    
    by wisnoskij ( 1206448 ) writes:
    
    Yes, but you would be hard pressed in my opinion to fund more than a few hundred regular websites that contain around or more than 1000 pages. Add in every medium or larger sized forum and it really seems like 1000 is a lot. I think the mode (type of average) website would have something like 10, with a bunch more at the 50 range, and still quite a bit at a few hundred. But I really do not see many websites that have over 1000.
    I guess news sites that keep every article they ever published in the last 100 ye
Assumptions and questions (Score:3)

by Martin S. ( 98249 ) writes: on Sunday April 07, 2013 @12:30PM (#43384567) Journal

There seem to be a few post making incorrect assumption and raising questions. I was involved as a technical architect on the long term preservation store aspect of this project few years ago.
archive.org The BL is already cooperating with a number of other organisations do the same thing thing, including the archive.org, the Smithsonian, Scottish, French, Australian, Canadian and quite few other National Libraries. archive.org has been an important technology spike for these but is not the whole solution.
Preservation BL has a legal responsibility to preserve it's archive, including this content essentially forever; which is a significant technology challenge.
Legal archive.org is essentially opt in; the BL programme is legal deposit requirement. The site content for any uk tld should be collected at least once a year. An important piece of the technology puzzle is to identify these and mange this process.
Scale The last scaling I saw placed the BL archive about two orders of magnitude larger than archive.org and growing faster. The number of new websites in .uk grows faster than the awareness of archive.org. There are a lot of challenges
- Maintain structure and semantic context.
- Searchable Meta Data
- Searchable Content
- Re-Presentation

Share
twitter facebook
French National Library does it since 2006 (Score:1)

by aikawa ( 776347 ) writes:

The BnF (French National Library) has started doing this in 2006 for a selection of .fr websites.
In 2011 they had 16.5*10^9 files.
They store content on "Petaboxes" made by the Internet Archive.
See http://www.bnf.fr/en/collections_and_services/book_press_media/a.internet_archives.html [www.bnf.fr]
Clarification for posterity (Score:2)

by illtud ( 115152 ) writes:

I'm pretty late to this story, but let me clear up some misunderstandings for posterity's sake:
Disclosure: I've been involved in this effort for at least ten years, I'm head of ICT for one of the UK Copyright Libraries (National Library of Wales), and this story goes way back to the Primary Legislation passed by the UK in 2003, and we've been working on the practicalities of this since before that legislation was passed.
* Yes, Internet Archive and others have been archiving web sites for many years. We're u

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Presumably (Score:4, Insightful)

Re: (Score:1)

Re:Presumably (Score:4, Informative)

Re: (Score:1)

Re: (Score:3, Insightful)

Re:Presumably (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

archive.org? (Score:5, Interesting)

Re: (Score:1)

Re: (Score:2)

Re:archive.org? (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Gotta love that management "thought" process (Score:5, Funny)

Data Storage (Score:2)

Re:Data Storage (Score:5, Funny)

Re:Data Storage (Score:4, Informative)

Re: (Score:3)

Re:Data Storage (Score:4, Funny)

Re: (Score:2)

You can't just do it once... (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The process will take five months (Score:2)

Re: (Score:3)

Re: (Score:2)

I'll see your Internet Archive and raise you... (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Wow (Score:2)

Illegal Content (Score:2)

Re: (Score:2)

Average Web Site (Score:3)

Re: (Score:2)

Re: (Score:2)

Assumptions and questions (Score:3)

French National Library does it since 2006 (Score:1)

Clarification for posterity (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals