Distributed Proofreaders Posts 5,000th E-book 144
bbc writes "Distributed Proofreaders has posted its 5,000th ebook to Project Gutenberg. The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers.
Distributed Proofreaders is a project that distributes the otherwise gargantuan task of correcting scanning and recognition errors in an OCR'ed text. The project has thousands of volunteers, of which many hundreds are active on any given day. It is currently the main supplier of etexts for Project Gutenberg."
Exxcelent Werk (Score:5, Funny)
Re:Exxcelent Werk (Score:5, Funny)
Re:Exxcelent Werk (Score:1)
Shocking (Score:2, Funny)
Re:Shocking (Score:1)
1390/484000 ~= 0.003
Re:Shocking (Score:2, Interesting)
So.... (Score:5, Funny)
Wonderful (Score:5, Informative)
Re:Wonderful (Score:3, Informative)
Hm! (Score:5, Interesting)
Re:Hm! (Score:3, Interesting)
Re:Hm! (Score:5, Informative)
Also, we are very comfortable with being a provider of *public domain* material, and I think many members wouldn't feel comfortable moving into the copy-restricted domain.
Re:Hm! (Score:2, Interesting)
Re:Hm! (Score:3, Informative)
Re:Hm! (Score:1, Insightful)
I need a new job (Score:5, Funny)
Re:I need a new job (Score:1, Funny)
Re:I need a new job (Score:5, Interesting)
Slow down! (Score:4, Funny)
Other than this I just found, the other 4985 are AOK so far.
Good work guys. Free the books. ook.
(re-reading Sourcery on the commute today... ook oook)
5052 (Score:2, Informative)
500 people read it? (Score:4, Interesting)
Hardly a non-put-downable... I suppose that is is a Biography (Shouldn't that be bibliography *chuckle*) of english literature is kinda symbolic.
I guess this more than doubles the total number of people who have read this book though!
I like Gutenberg, I hope they start a system where you can download copyright books for a micropayment, I would pay good money for text ebooks.
Lets hope ebooks don't go the way of music, keep the costs low, no DRM fluffing up the download. If you can click 3 times and start reading a new book, and it costs you euro's then you would preffer that than d/l gigs of warez.
Anyone who illegally downloads lots of books, tends to be the person who does't read them much anyway (Someone boasted to me that they had 300 O'Reilly books, squirming under the desire to tell me that they were eBooks, off irc, oh lawks, what a riot, I wish I was your friend, go away)
Re:500 people read it? (Score:5, Insightful)
Rather than setting up a complicated system to make micro-payments that only some people would follow anyway, do what I do, determine a fair value for youself and make a donation. Not for one book, but estimate a year or two worth so you don't 'nickel and dime' the value of you donation with transaction fees.
Re:500 people read it? (Score:3, Interesting)
Who cares what publishers think, they are wondering how they can be a middle man in a digital age. We will start with good bi-format books, all available in eBook, all 100% well formatted. Then some will move more over into eBooks.
Then every internet who [xmission.com]
Re:500 people read it? (Score:2)
Is it possible... (Score:4, Funny)
Instead of 'WHat light through yonder windows breaks?' we get 'Who is that hot chick I can see through my binoculars?'
How strange (Score:2)
Re:How strange (Score:4, Informative)
Re:How strange (Score:2, Interesting)
I got a bit carried away. This 5000th project was organized so that as much proofreaders as possible would work on it. (Although any book going through DP runs a chance of being proofread by many separate people, usually proofreaders stick with a certain book for a while, so that the work has only been seen by 50 or so.) I was so glad we pulled it off, that I sent a story to Slashdot without thinking.
Re:How strange (Score:1)
JHutch
Re:How strange (Score:1)
But you have to bring the beer for the real thing.
A shame (Score:5, Insightful)
I just don't understand the point of retroactive copyright extensions. The idea behind copyrights, like patents, is to encourage innovation by allowing the creator an exclusive right for a limited time. If people believe copyright terms need to be extended to achieve this goal, fine. I disagree, but whatever. However, I think it's ludicrous that terms should be extended on works that have already been created, unless maybe they think that extending terms retroactively will lead to more works being produced in the past?
Re:A shame (Score:5, Insightful)
There's nothing to understand. Everything's about money now. Nobody cares about books, art or people. If you can make money - especially on the work of authors usually living near poverty - long after they are dead, then you are the winner of this big capitalistic orgy!
Re:A shame (Score:2, Insightful)
You could fill a music hall with people and pay the performers. You want to open another music hall? You need another set of performers.
Recorded music meant that each copy scaled the initial costs down. This has, over time become even more exaggerated, though. At one time, record product
To paraphrase... (Score:1)
Re:A shame (Score:2)
PS: Just don't take the world with it.
Re:A shame (Score:2)
A nice explanation of how this all works out can be found i
Make them renew each year (Score:4, Insightful)
Here's what I want to see:
You get automatic copyright for 25 years. After that, you must pay $1 per year to keep something in copyright. If you can't be bothered to keep track of your stuff and pay the $1, it lapses into the public domain.
Disney will pay the $1 for Mickey ($1 for Steamboat Willy, $1 for each other cartoon, $1 for each book, etc.). But forgotten gems, like ancient Apple ][ games, will become legal public domain items.
I'd actually like to see a hard limit of 50 years or so for copyright, but even if you can't get that, at least the above scheme makes alot of stuff lapse into the public domain.
A cool feature: if the legal trail is tangled and murky, and no one knows who owns it anymore, no one will pay the $1 and it will fall into public domain. Let's say LSD Software wrote a fun game for the Commodore 64. Then ABC Games bought the game from LSD (who kept the rights to use the music in future games). Then ABC Games went under, but its assets were bought by PDQ Games, which later split into PDQ Software and Foo Bar Games. After that it gets REALLY complicated... anyway, after all that, who exactly owns that fun game? No one knows. It would take a court case to decide, but no one will bother so no one will ever know. Under the current system, you are technically a pirate if you keep the game, but there is no one you can pay a license fee and legally have the game! Catch-22.
Heck, Disney should want this. They make big bucks by Disney-ifying public domain stuff, so they should make sure things will actually go into the public domain in the future.
Re:Make them renew each year (Score:4, Interesting)
As far as your scheme though, I would really like a hard extension limit and I think 25 years for a default term is really too much (I mean, to use your example of Apple II games, many of those games wouldn't even quite be out of term yet). I think 5 or 10 would be much better.
Re:Make them renew each year (Score:3, Interesting)
I would even go a bit further. Why even have a default term at all? (and 25 years is a LONG time) And $1 is arguably a bit little. If you really care, you can pay a bit more. Maybe we can even have different levels of protection - pay nothing if you allow modifications, pay more to retain excl
Re:Make them renew each year (Score:2)
But creating a intellectual property tax to be paid after a piece of IP turns 25 is, IMHO, a good idea. Take the example of the beatles, if it wasn't for disney and friends lobbying to have copyright extended then their work would already be public domain. But the beatles music still makes money, fair enough, while financially lucrative the copyright holders can afford to pay t
Re:Make them renew each year (Score:1)
It's not about the money, it's about the effort. Most people won't be willing to renew most works. As a result, these works become public domain (and verifiably so).
This creates several benificial situations:
1. If you want to use a work that the author lost interest in, you can.
2. If you want to use a work that the author still is interested in, you now have a way to find out who the author is and how he can be contacted.
(When I say 'use' I mean 'use in a way that wou
Re:Make them renew each year (Score:1)
Very much so.
The fact that Big Copyright have declared themselves fierce opponents to any law that would reintroduce registration and renewal in the US, has made some people remark that their ultimate motive is control.
$1 the first year, $2 the 2nd, (Score:2)
By the time the copyright got to 21 years it'd be over a million dollars to renew it, which would strongly encourage people to just let it go to the public domain. This way would also protect small time inventors/writers, since even at 7 years, it's only $64 to
Re:A shame (Score:2)
Moreover, I submit that it is an unconstitutional ex-post facto law. There is a reason the Constitution prohibited retroactive laws, but we seem to be ignoring that principle today.
Who picks this stuff? (Score:4, Interesting)
Still, I look forward to the day when someone starts digitizing the Mechanics Institute Library in San Francisco. It's a beautiful private library one can join. The books are in excellent condition, and there are century old original editions on the shelves.
But it's the magazine collection that's stunning. They have Popular Mechanics in bound volumes, all the way back to the beginning, when it was a serious scientific journal. All the major railroad magazines from the heyday of railroading. Every issue of Electric Railway Journal (the trade magazine of streetcars). Few other libraries kept that stuff.
Re:Who picks this stuff? (Score:1)
If you want to see geek heaven go look through the adverts...
I keep going to look in the hope that someone will put Olaf Stapleton or EE "Doc" Smith up but alas rights are a real bitch...
(1950's SciAm are pretty cool too - stuff about electroluminescence and (cough) computers).
Re:Who picks this stuff? (Score:3, Informative)
Until the middle of last year, we focused almost exclusively on books. Since then, we've been putting some very interesting periodicals through the site (Punch, The Strand Magazine, Scientific American, Notes & Queries, to name but a few). Magazine aimed specifically at boys (or, indeed, girls), would be a great addition to the pile!
Re:Who picks this stuff? (Score:1)
good books? (Score:1)
does anyone have suggestions for fiction titles on gutenburg?
i need a good read, but i dont want to pay or find something good myself.
Re:good books? (Score:3, Informative)
Re:good books? (Score:2)
I download a load of texts, put them on my ipaq, then use flite to do text->speach and read them out.
Btw, has anyone thought of marking up any of the books so they can be read better by something like festival? (emotions, sex of character etc)
Re:good books? (Score:3, Interesting)
That site has a couple of good ones. You should read first "The lost continent". The book was written shortly after, or during WWI and follows a hypotetical developement of the world if the new world and the old world had lost comunication until 200 years later. The most interest thing about those old science fiction books is to contrast their world view with ours and to see what futuristic devices would exist by now.
Cheers,
Adolfo
Re:good books? (Score:5, Informative)
Re:good books? (Score:2)
Seriously, the text was in pretty bad shape, with lots of common OCR errors: 1 = I, 5 = S, b = h, etc., chapter titles missing, etc. Does DP take on new versions of existing PG books? I'd volunteer to try and do a better job on Ulysses.
Re:good books? (Score:1)
Yes, but I don't know if there are any conditions attached.
Better would be to next time keep notes of all the errors you encounter, and send them to Project Gutenberg, where volunteers will use them to correct the book.
The Project Gutenberg FAQ tells you what to send where, and how.
Re:good books? (Score:4, Insightful)
Yes, we do -- although as I mention in an earlier post, we have a year's worth of material as it is, without going back and re-doing the older material already in PG. However, as you say, some of PGs content is below the standards we expect of newly produced text. Hopefully we can go back and correct *all* PGs content over time. The main factor stopping us is that we need page scans of any project before it can go through DP. If you know of any page images of a clearable edition of Ulysses, or indeed if you have a clearable edition which you are willing to scan, then we would gladly put it through the site.
Re:good books? (Score:1)
Some of the famous literature that is in the public domain: Jules Verne, Sherlock Holmes, Frankenstein, War of the Worlds, Wuthering Heights, the Bible, anything Shakespeare, Aesop's fables, Mother Goose, Alice in Wonderland, Wizard of Oz, Ulysses (both Homer's and Joyce's versions), The Picture of Dorian Gray, Heart of Darkness, Treasure Island, The Jungle Books, et cetera, et cetera.
law of averages? (Score:2, Interesting)
However, I am curious as to just how accurate the proofreading is. I think that they try to improve accuracy by having many different volunteers; accuracy in numbers and all that. However, just because many people think in a certain way, does not mean that what they think is accurate. Just look at standardized tests. They are specifically designed to make
Re:law of averages? (Score:5, Informative)
The answer is: surprisingly accurate. We proof one page at a time, working from the original scanned images, and emphasise that people should try as hard as they can to stick to the source material. As counter-intuitive as it may appear, this type of proofreading is actually hardest to do with material from the late 18th/19th century -- subtle changes in spelling (and small changes in accent systems for the non-English languages) make errors much harder for human proofreaders to correct than the earlier material, where spelling consistency was completely optional!
Each page is OCRed (and the ability of modern OCR programs is a major improvement over those of even a couple of years ago), proofread twice, and then the whole document is reviewed twice before being posted. We've also recently become much more aware of the need to make useful texts which can be used for scholarly purposes in the future, leading to such improvements as retention of all page numbers.
Re:law of averages? (Score:5, Insightful)
At the risk of going over very old and well-trodden ground, if PG wanted to be useful for "scholarly purposes" it should long ago have corrected the original mistake of using plain text, and used a markup that could have kept page numbers and other meta-information for scholars, while giving the common reader a clean text with a suitable style sheet. But even today on the PG website is a "justification" [gutenberg.net] for sticking to plain text making it clear that scholars don't even figure in the intended audience for PG texts.
Re:law of averages? (Score:3, Informative)
For example, many of use make sure that we produce a valid XHTML edition of each project, and that the page numbers and edition information of the source are preserved. For an example text, see Graham Wallas -- Human Nature In Politics [gutenberg.net]. We are currently working on a markup
Re:Any chance images could be made available? (Score:3, Insightful)
It's possible that we might interface with something like the Million Book Project [archive.org], which makes page images, but no text, available.
Re:law of averages? (Score:2)
Personally I'm of the opinion that allmost everything is better represented as plain text. In extreme cases, maybe plain text + italics, bold, and the ability to link in pictues.
I can understand other arguments, but in general, I think plain text is the most universal and common format - and thus best suited.
Maybe every
Re:law of averages? (Score:1)
The sort of scholar that would make such unqualified statements about the need for mark-up has no place in academia.
Project Gutenberg has excellent reasons to stay with plain text as the most basic distribution format, reasons that have proven themselves over time.
Smart scholars have many uses for plain Gutenberg texts.
Re:law of averages? (Score:1)
I agree, PG text format is not good for reproducing the features of a printed book. Much better to use something like TEI [tei-c.org]. However, marking up in a semantic format raises a hairy issue: the proofreader needs to interpret the meaning of textual elements (such as italics which are used for a foreign language term - that's different from italics used as emphasis). That requires more training than simple PG markup. And of course there is the issue of a decent user interface...
Having said this, maybe these pro
Re:law of averages? (Score:2)
The audience for PG is EVERYONE - every single person on the whole planet, reg
Re:law of averages? (Score:2)
Sorry, but you don't get it. (Score:2)
You could also store documentation for this format and even source code (in a variety of languages) for a program that converts the metadocument into straight text. Then you won't have to worry about converting each of them painfully or worry about outdated formats.
And even if the worst happens and the format becomes outdated and unreadable, the text is still there, hidden in markup. It wouldn't be that hard for s
Accuracy (Score:4, Interesting)
One of the books I worked on was the "Anatomy of Melancholy" and I (conveniently) have a copy myself. There were often more differences between the scanned image of the page and my copy than between the scanned image and the proofread text.
Don't underestimate the amount of work people put into this too - for "Anatomy of Melancholy" it often took 30 minutes to proof a single page because the page often had latin and very small footnotes.
Re:law of averages? (Score:2, Informative)
That's very hard to tell, as there is no gold standard for accuracy. There are two sometimes conflicting goals in regards to accuracy that we have; one is to preserve the author's intent, the other to preserve the actual printed text. At some points these two conflict, for instance, when we would like to normalize spelling to increase readability.
There is currently some talk going on at the DP forums as to which system would be best to e
Rsync your own Gutenberg library (Score:5, Informative)
Just be aware that the Gutenberg is some 135GB, and much of it is gif jpg and mp3 (spoken work books). So i just used --include in rsync to download the .txt .htm and .html files. Its a more manageable 10GB download.
Re:Rsync your own Gutenberg library (Score:3, Informative)
I hope... (Score:1)
Re:I hope... (Score:1)
If a volunteer feels comfortable with MS Word, then by all means they should try and commit a book in that format. The only demand Project Gutenberg makes, is that the etext is also submitted in 'plain vanilla text' format, so that anybody can read the text, anywhere and anytime.
formatting (Score:4, Interesting)
My only complaint is with the formatting. Project Gutenburg uses hard formatting within the text. I think that's an extremely stupid idea.
There should be zero formatting within the text (other than paragraph breaks). Whatever client you're using should provide the formatting for you.
Let the client handle the presentation!!
Re:formatting (Score:1)
Re:formatting (Score:2)
Re:formatting (Score:1)
Currently, there is at least one effort to come up with a XHTML conformant standard (it has stalled somewhat due to summer volunteer burnout) and a TEI-lite conformant standard. The problem is getting a standard simple enough for the average lay person to remember it well enough to actually mark up texts, while complex enough to handle 99% of the texts we see.
It ain't easy!
JHutch
Re:formatting (Score:3, Informative)
As much as I think the project is digging thems
Helping improve OCR software? (Score:3, Insightful)
Think about it. You have thousands of volunteers pouring over images, and then providing the corrected text (if necessary). Couldn't this also be used to "train" the OCR software to become better at identifying text?
If you log the image, the original OCR'd text, and the manually verified text you could use it in a test case for future OCR software.
I do this all the time when I write data validation/cleanup software.. I run my input data through a program, capture the output, and manually verify that it is correct.. making changes if necessary. I then use the two pieces of information in my test cases as a benchmark. If I introduce a bug in my code that causes something I already wrote to suddenly break, or output incorrect results, I know about it instantly. Works great with database correction code.
Maybe I'm simplifying this too much, but I sure hope someone is capturing all this great data. It could come in handy..
from the error-checking-and-correcting dept. (Score:5, Funny)
What books to read (Score:2)
Re:What books to read (Score:2, Informative)
Of the authors I got to know through Project Gutenberg, Stephen Leacock and Theodor Storm stick out in my mind the most. Oh, and Hendrik Conscience turned out to be less boring than I thought after proofing the first of his books to go through DP (but so far he's only available in Dutch).
Request for MATH experts (Score:5, Interesting)
JHutch
Re:Request for MATH experts (Score:2)
Re:Request for MATH experts (Score:2, Informative)
Only one MATH book is ever in the first round at any one time. Hilbert's book is that one right now.
The logic behind this is simple. Most of our volunteers avoid these books like the plague and if we kept releasing new ones, pretty soon the entire first round would be only MATH books.
To see what's waiting in the queue for English language math books, see here [pgdp.net]. For Languages Other Than English (LOTE) math books, see here [pgdp.net].
Public apology (Score:3, Informative)
The 5000th Posted celebrations were supposed to be internal. There is a discrepancy between works posted and books posted: sometimes a book gets split up. The big celebrations were intended for 5000 actual books posted.
I am afraid I got a little carried away, and hope Slashdot will still carry the real story of 5000 books posted to Project Gutenberg.
I miss the smell of damp paper - (Score:2)
What ever happened to Project Gutenberg 2? (Score:2)
Re:What ever happened to Project Gutenberg 2? (Score:2, Informative)
Have you read an ebook? (Score:1)
1. Download a text: (say Alice's Adventures in Wonderland [gutenberg.net]). The new site has a vastly improved interface; listing books in available formats (always plain text, sometimes pdf, palm doc, tex)
2. Have at it in you text reader of choice. If you are on the mac, I highly recommend the free tofu [mac.com]. It breaks the text into columns that are high as the window. Navigate by shifting columns or pages of text. This simple change makes a huge difference when reading large amounts of
Once again (Score:2)
Again, I think the likelihood to two independent OCR processes (seperate text, seperate scanners, seperate OCR packages) would both make the same mistake, so it's mostl
Re:Once again (Score:2, Informative)
There are several reasons. Firstly, there are lots of people around who can spare five minutes to proofread a page -- particularly when it has already been OCRed. Secondly, we are a completely volunteer organisation, with no 'plan' as to the books we scan, and so having to find and scan two seperate copie
Old Scientific Works? (Score:2)
How about old scientific works, journals up to say 1920's?
I know the more recent journal articles are copyrighted and therefore must have some lengthy protection on them, but what about classic old articles (like some of Einstein's work in the early 1900's)?
because (Score:1)
wait a few (read: lots) of years and you'll be seeing 'em tossed up there, editors duitifully rendering pictures into ascii, etc.
Re:because (Score:4, Interesting)
Very true, although several of us do keep talking about searching for some Victorian Porn to put through the site
Re:because (Score:2, Interesting)
Yeah! I'm one of the "several" that Jon's referring to. I got a real kick out of recent book that was posted by us to PG...
Sane Sex Life and Sane Sex Living [gutenberg.net]
For a turn of the century study of sex (published 1919), this guy was amazingly (IMHO) progressive! A very fun read! JHutchRe:What about 5001? (Score:3, Funny)
I'm assuming your signature link is related to this, so yes, you could say you did call the press.
Re:What about 5001? (Score:5, Funny)
Re:What about 5001? (Score:1, Funny)
That sir, is not constipation. That is uncontrollable demon bowels.