Counting the World's Books 109
The Google Books blog has an explanation of how they attempt to answer a difficult but commonly asked question: how many different books are there? Various cataloging systems are fraught with duplicates and input errors, and only encompass a fraction of the total distinct titles. They also vary widely by region, and they haven't been around nearly as long as humanity has been writing books. "When evaluating record similarity, not all attributes are created equal. For example, when two records contain the same ISBN this is a very strong (but not absolute) signal that they describe the same book, but if they contain different ISBNs, then they definitely describe different books. We trust OCLC and LCCN number similarity slightly less, both because of the inconsistencies noted above and because these numbers do not have checksums, so catalogers have a tendency to mistype them." After refining the data as much as they could, they estimated there are 129,864,880 different books in the world.
I propose a new filesystem (Score:1)
Boobs (Score:1)
How do you define "different book"? (Score:4, Interesting)
Look at textbooks - new editions that are almost indistinguishable from the previous editions have new ISBNs. Do we count every single one as a different book?
Re: (Score:2)
Re: (Score:2)
Also hardback vs. paperback, publishing in different regions as a distinct book, etc. Maybe ISBNs could be extended so that it encodes all these different fields in additional digits so that there is a component that is unique to a specific book (regardless of edition, publisher, etc), extra information that uniquely* identifies which specific edition/version/variant of the book it is and then yet more information that uniquely identifies which publisher circulated that book.
*A SHA-2 or SHA-3 hash of the bo
Re: (Score:2)
And every goddamed one of them is scanned by google, foisted by Barnes and Noble and Amazon and everybody else as a separate book.
I once counted twenty different versions of the same popular (copyright lapsed) classic, all scanned by Google, many from the exact same edition found in various libraries. Some horrible, some quite readable.
I'm not sure anything is served by having both the 1902 and the 1903 versions of any popular fiction available in ebook form. Any serious researcher would search out the ph
So ~200TB = "All The Books" (Score:2)
A typical book is in the range of 1-2MB of text, assuming you're representing actual letters, as opposed to scanned images of the text, and ignoring illustrations, pictures, etc. So if there are about 130 million books, that's about 200TB to store them uncompressed, maybe 50TB compressed. If you've got multiple versions that are almost identical (e.g. Third Printing from Paperback Publisher B has a different copyright page than First printing from Hardback Publisher A, and maybe a different cover page ill
Re: (Score:2)
I just compressed a 2.0GB plain text log file to 42M, so I imagine compressing a book would have roughly 40:1 performance. So 200TB would compress to more like 5TB. If we used a sort of cell-shading on colorful images with flat color regions, we could turn them back to simple images that blot down to PNG very well; same with black and white illustrations, although restoring colored pencil on paper would pose difficulty (really, we want to identify the paper and remove it in favor of a paper-like backgroun
Efficiency of lossless compression (Score:2)
Log files are typically very structured low-entropy data. With random natural-language text you seldom get better than 3 or 4 to 1 lossless compression. Image compression can do better, but that's typically already been done to get the JPG/PNG/GIF/etc., and it's typically lossy, and of course video compression is much better because most of an image doesn't change much from frame to frame. But in this case they're trying to OCR the data, so much of that image compressibility has already been replaced (be
Re: (Score:2)
Structure has nothing to do with compression: only content affects redundancy. In this case, a lot of it is IP and IDS logs, mixed with system logs, etc... it's a log of everything that touches syslog, basically. And bzip2 uses 900,000 byte blocks, so anything more than a meg away is irrelevant.
English text compresses rather well in any case, as it is by nature well-structured and redundant. I used to analyze uncompressed English text through shitty encryption by shit as simple as two-tuple frequency c
Re:How do you define "different version"? (Score:2)
I can tell this topic is going to be dominated by people who never had to deal with the internals of a revision-control system, much less a configuration-management system, because the issues are somewhat trivial once you get past your fear of the variables.
Re: (Score:1)
Also by people who have never read the article, where it explains in some significant detail how they try to determine what constitutes "a book" for the purposes of their counting.
Re: (Score:3, Funny)
You read the article?
Impostor! Burn the witch!
Re:How do you define "different version"? (Score:4, Informative)
Look at textbooks - new editions that are almost indistinguishable from the previous editions have new ISBNs. Do we count every single one as a different book?
From TFS : if they contain different ISBNs, then they definitely describe different books
If they're using this method, GP's point is valid. The books are not really new books, they're essentially the same as previous editions but have different ISBNs. In essence, these new editions with new ISBNs are being counted twice (or more) for very small revisions to the same book.
Re: (Score:2, Informative)
From TFA: Well, it all depends on what exactly you mean by a “book.” We’re not going to count what library scientists call “works,” those elusive "distinct intellectual or artistic creations.” It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries. (emphasis mine)
For Google's definition of what constitutes a unique work as used
Re: (Score:2)
I was not proposing a new method of counting books... I was only supporting the OP in his assertion that their method contains limitations regarding repetition of works with minor differences.
I was mainly responding to those who just said RTFA without seeing basic facts in TFS.
Re: (Score:2)
Again, this is why I'd like to see additional information encoded in an extension to the book's ISBN number, such as a hash of the contents. Regardless of what the extension is, the split should permit you to identify "works that descend directly from a single work" plus "works that differ in content" (regardless of what they descend from). Then there would be no problem. You would be able to extract the level of information you wanted and no information would risk getting lost because such-and-such a group
Re: (Score:2)
Re: (Score:2)
Also, if a publisher purchases a title from another publisher it gets a new ISBN with the new publisher even though it is the same book.
Re: (Score:2, Informative)
Re: (Score:1)
It's a one-page article, and contains a really good explanation of what they mean by a book for the purposes of their counting, and why.
The following sentence from the article really which cuts straight to the heart of their concept of uniqueness:
It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries.
So, yes, if they scan textbooks they'll scan all versions they can
Re: (Score:2)
Re: (Score:2)
That's not true. Getting an ISBN isn't hard and self publishing companies will generally assign you one as part of the deal.
Re: (Score:2)
That's not true. Getting an ISBN isn't hard and self publishing companies will generally assign you one as part of the deal.
Amazon's Kindle, for example, will assign you an ISBN. However, if you bought your own ISBNs you can use them too. You are supposed to assign a different one to the eBook, paperback, audio and hardback. However, if you use the same one for all there are not many checks to stop you if you are using multiple services.
Re: (Score:2)
Depends on the size of the publishing house and the expected sales volume. If you're selling through a major bookstore chain, yeah, you're going to have an ISBN. For an independent author selling a few hundred copies of a book on the history of Three Way [google.com] in a local bookstore, you probably won't have an ISBN---particularly if the book printing and binding was done at the Kinko's in Ja
Re: (Score:2)
8 or 9-place estimate (Score:2, Insightful)
estimate would be about 130 million, not 129,864,880
Re: (Score:3, Insightful)
Re: (Score:2)
Significant digits are for science teachers and marriage counselors!
Ok, what am I missing here?
Re: (Score:3, Funny)
Ring finger, presumably.
Re: (Score:2)
But 130 million can't possibly be right! We better assign some false precision to make our estimate believable. Significant digits are for science teachers and marriage counselors!
Why stop at 8 or 9? 18 is much better and just as meaningful: 129,864,880.461938427
Whew....almost done! (Score:2, Funny)
I'm almost done reading them all!
Re: (Score:2)
Re: (Score:2)
I'm almost done reading them all!
That's my next challenge - once I've finished reading the web.
Re: (Score:2)
I can just ruin the ending for you....
http://www.wwwdotcom.com/ [wwwdotcom.com]
Re: (Score:2)
If they don't have the will to obtain an International Standard Book Number for their Internationally published book, then why bother counting it at all? After all, I wrote a book in first grade, consisting of 16 pages of poorly drawn pictures and brutal (if accurate) grammar... Should this be counted too?
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
How many Libraries of Congress is that?
Re: (Score:1)
0.13 Gigabooks.
That's an ESTIMATE? (Score:4, Interesting)
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
If you RTFA (blasphemy, I'm sure), Google doesn't say that 129,864,880 is an estimate - they say that is the number of books, total (at least until Sunday).
The only estimate is mentioned is "16 million bound serial and government document volumes".
Surprise surprise, subby is the culprit that turned such an exact number into an "estimate".
Re: (Score:2)
That's the point. There is no way in hell that their accuracy is that great.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
You lose accuracy by representing error bounds simply by the significant digits of the number. It is convention-dependent that the last sig fig is assumed to be +/- 1 (zero being assumed non-significant unless followed by a decimal point, unless the zero is already after a decimal point). That's what I remember from high school chem. And it's a convention that makes sense for, say, reading a temperature off of a thermometer. You don't know if the actual value was rounded up or down to give the instrumen
Wow (Score:3, Insightful)
They should write a book!
Re: (Score:2)
I considered that, but there's a problem... (Score:2)
So if I wrote a book about this, should I call it "The 129,864,880 Books That You Must Read Before You Die", or "The 129,864,881 Books That You Must Read Before You Die"?
Seriously... (Score:2)
Re:Seriously... (Score:4, Insightful)
Who cares? Does it matter?
Does anything?
Re: (Score:2)
Mod parent up...+1 emo.
Re: (Score:2)
No... don't be an asshole, GP won't like that. Mod GP down -1 Emo.
Re: (Score:2)
Ooh. I've got it. We'll call it the Library of Congress crypto scheme. We could use it for encrypting other stuff, too. Any arbitrary word could be encoded as an LOC identifier, a page number ,and an offset in bytes or words. Man, wouldn't that suck to decrypt?
1 in 50 people wrote a book (Score:2)
If you divide the number of books by the current world population, you get that there are one unique books for every 50 people, or on average one in 50 people wrote a book, including many poor, illiterate and children.
Of course, some book writers have died and many have written more the one book, but I suspect that most books have been written recently and their writers are still alive.
If you only include adults who live a comfortable western lifestyle, it may be as maybe as high as one in 10.
Re: (Score:3, Insightful)
I suspect that most books have been written recently and their writers are still alive.
And I suspect that you are full of crap.
Re: (Score:1)
Re: (Score:2)
90% of all scientists who ever lived are alive today and many of the books have been written by scientists.
While the percentage may not be has high for all authors, but I think it would be close.
Re: (Score:2)
Given the enormous explosion in literacy and printing press technology over the last 100 years, I would say he's probably closer than you think. Also, it's estimated that human knowledge doubles every 7 years -- that would mean a doubling of the number of things written down or published.
What would resolve this is to discover how many books existed 100 years ago, and 50 years ago.
Re: (Score:2)
A suprisingly large proportion of the humans who ever have lived are actually alive now (most people estimate it about 10%). It is _way_ easier now to publish a book than it was even 100 years ago.
I'm not saying you're wrong about GP's assumptions made, but personally I'd guess he's right. That's just a guess though ;).
Re: (Score:2)
The Straight Dope, in 1987, said:
http://www.straightdope.com/columns/read/413/how-many-people-have-lived-on-earth-since-the-dawn-of-time [straightdope.com]
Re: (Score:1)
Indeed, just yesterday I met Shakespeare. He was talking with Lewis Caroll and Douglas Adams. Unfortunately I couldn't talk to them, because Plato was just coming around the corner, arguing with Aristoteles and Kant about some philosophical problem, and I would have been in their way. On the other side of the room, Mao was arguing with the evangelists about who has written the better Bible. Karl Marx didn't help Mao, beca
Re: (Score:2)
Faulty generalization [wikipedia.org]
Re: (Score:2)
So you're dead, and talking to us from Riverworld, right?
Re: (Score:2)
Isaac Asimov wrote over 500 books. I don't know know haw many Terry Pratchett has written but the number is in the dozens. There's Clarke, Heinlein, Niven... and those are just a few science fiction writers (yes, Asimov also wrote nonfiction and Pratchett is known mainly for fantasy). Serious authors write more than one book each.
So your average is a little meaningless.
Re:1 in 50 people wrote a book (Score:4, Informative)
Isaac Asimov wrote over 500 books. I don't know know haw many Terry Pratchett has written but the number is in the dozens. There's Clarke, Heinlein, Niven... and those are just a few science fiction writers (yes, Asimov also wrote nonfiction and Pratchett is known mainly for fantasy). Serious authors write more than one book each.
So your average is a little meaningless.
No, averages are very meaningful. Extremely meaningful. They are the AVERAGE (usually the mean), which means that some values will be above, and some values will be below. The idiocy comes in when people mistakenly jump to the conclusion that just because an average exists, it means that every value must be exactly the same as the average. Or, just because you can find extreme values far away from the average that again the average is not meaningful.
If the average states that 1 in 50 people have written a book, then, by gum, it will be easy to find plenty of people who have written zero books, somewhat fewer who have written exactly one (something below 1 in 50), much fewer who have written exactly two, even fewer who have written exactly three, etc. That does not mean that example authors with hundreds of books cannot exist, it only bounds how frequent they can be.
Of the myriad of ideas that the academic community has utterly failed in educating the general public about, it's the relationship between averages and distributions. One more time: just because an average exists, it does not mean that every datum has the same value as the average. As an example, just because the average male in the US is 5' 9", it does not mean that every single male is that tall, nor that you will not find ones that are shorter, taller, or even much shorter or much taller. The tallest man (according to my 20 seconds of research through Google) was 8' 11", and the shortest was 1' 10" ... does that lessen the meaningfulness or utility of the average male height? Rather the contrary: it provides important information as to the extent of the distribution of heights.
Now, I suspect that the parent poster is trying to say that because -- by loosely founded speculation -- most authors are professional authors ("serious authors") and therefore will have more than one book to their name, the classification of people into authors and non-authors will be skewed against 1:50. I would not argue against that (in fact, I indirectly argued for it above). Nevertheless, using the utterly non-scientific sample of the books above my desk, most authors have only one book to their name, so the number isn't going to be much worse than 1:50, perhaps 1:55 or 1:60. That kind of pure, unadulterated speculation is exactly the sort I would love to see proved wrong with hard data.
Re: (Score:2)
One more time: just because an average exists, it does not mean that every datum has the same value as the average.
That was the point I was trying to make. If there is one book written for each group of fifty people, the average would be one in fifty but the actual number of authors would be less than one in fifty people, probably far less. But as you say, there's no way of knowing how much less without actual data.
Old News (Score:2)
Qoh.12 [12] ... Of making many books there is no end,
They could just use (Score:2)
E.g. UPC code 54556 39824. Sum is 51. Next tens is 60. 60-51=9 so the check digit is 9. The same basic formula could work for ISBN numbers too.
Re: (Score:2)
ISBN check codes are designed to catch common errors back when hand entry was common -
a run of two digits in the wrong place (eg 556 instead of 566)
a mistyped digit
two digits swapped around by one place
The UPC code does not support the latter at the expense of only requiring the check symbol to be one of 10 regardless of the number of digits in the code. The ISBN algorithm requires n+1 where n is the number of data digits. Whether this is required nowadays given that very few ISBNs are entered by hand is
can't grok the numbers... (Score:1, Funny)
129,864,880 different books? What is that in Libraries of Congress?
Re: (Score:1)
129,864,880 published books, that is. (Score:4, Insightful)
This estimate isn't bad for published works, but it does not adequately answer the question posed, ``Just how many books are out there?''
ISBN sucks for digital books (Score:4, Insightful)
ISBNs suck as identifiers for digital books, especially digital books that are free. There are two problems.
Problem number one is that they cost money. Let's say someone writes up a really nice manual documenting some open-source software. He wants the manual to be free, just like the software. But now if he wants an ISBN, he has to pay money to get the ISBN, which means expending dollars on a book that is not going to be bringing in any dollars. The fact that ISBNs cost money is out of step with the fact that we have this thing called the World Wide Web, which is basically a huge machine for letting people do publishing without the per-copy costs that are associated with print publishing.
The other problem is that ISBNs are supposed to uniquely identify an edition of the book. This makes sense for traditional print publishing, where the economics of production forced people to make discrete editions widely spaced in time. It makes no sense for print on demand or for pure digital publishing. I've written some CC-licensed textbooks. When someone emails me to let me know about a typo or a factual error, I fix it right away in the digital version, and I usually update the print-on-demand version within about 6 months. No way am I going to assign a different ISBN every 6 months.
We can say that ISBNs are for printed books, not for ephemeral web pages, but that doesn't really work. The two overlap. My textbooks exist simultaneously as web pages, pdf files, and printed books. Amazon sells a book for the kindle using one ISBN, assigning a different ISBN to the printed version. Print-on-demand books share some characteristics with printed books (e.g., they're physical objects) and some with the web (can be updated continuously).
By the way, why do you think library catalogs don't show ISBNs? It's because ISBNs are meant as commercial tools, like the barcode on a box of cereal. If google finds ISBNs useful for other purposes than selling copies of books, it's probably because google is trying to deal with a massive number of books using a minimum amount of human labor.
what about pre-20th century works? (Score:2)
1)TFA actually acknowledges that the ISBN is very North America-centric, but the other cataloging types are also either N.A-centric or at least western world-centric.
2) The entire article is based on efforts to simply compile a list of books by aggregating and loosely filtering/sorting several other lists. The lists mentioned are, as far as I know, all heavily biased toward 19th and 20th century works. (The article explicitly menti
129,864,880 different books in the world (Score:1)