Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Books Google Math

Counting the World's Books 109

The Google Books blog has an explanation of how they attempt to answer a difficult but commonly asked question: how many different books are there? Various cataloging systems are fraught with duplicates and input errors, and only encompass a fraction of the total distinct titles. They also vary widely by region, and they haven't been around nearly as long as humanity has been writing books. "When evaluating record similarity, not all attributes are created equal. For example, when two records contain the same ISBN this is a very strong (but not absolute) signal that they describe the same book, but if they contain different ISBNs, then they definitely describe different books. We trust OCLC and LCCN number similarity slightly less, both because of the inconsistencies noted above and because these numbers do not have checksums, so catalogers have a tendency to mistype them." After refining the data as much as they could, they estimated there are 129,864,880 different books in the world.
This discussion has been archived. No new comments can be posted.

Counting the World's Books

Comments Filter:
  • by jonnythan ( 79727 ) on Friday August 06, 2010 @12:33PM (#33164698)

    Look at textbooks - new editions that are almost indistinguishable from the previous editions have new ISBNs. Do we count every single one as a different book?

    • Same thing with any other book. Second editions and republishings (the Del Rey version versus the Pyr version, etc) with the same exact text unedited; multiple publishers of public domain works; etc.
      • by jd ( 1658 )

        Also hardback vs. paperback, publishing in different regions as a distinct book, etc. Maybe ISBNs could be extended so that it encodes all these different fields in additional digits so that there is a component that is unique to a specific book (regardless of edition, publisher, etc), extra information that uniquely* identifies which specific edition/version/variant of the book it is and then yet more information that uniquely identifies which publisher circulated that book.

        *A SHA-2 or SHA-3 hash of the bo

      • by icebike ( 68054 )

        And every goddamed one of them is scanned by google, foisted by Barnes and Noble and Amazon and everybody else as a separate book.

        I once counted twenty different versions of the same popular (copyright lapsed) classic, all scanned by Google, many from the exact same edition found in various libraries. Some horrible, some quite readable.

        I'm not sure anything is served by having both the 1902 and the 1903 versions of any popular fiction available in ebook form. Any serious researcher would search out the ph

      • A typical book is in the range of 1-2MB of text, assuming you're representing actual letters, as opposed to scanned images of the text, and ignoring illustrations, pictures, etc. So if there are about 130 million books, that's about 200TB to store them uncompressed, maybe 50TB compressed. If you've got multiple versions that are almost identical (e.g. Third Printing from Paperback Publisher B has a different copyright page than First printing from Hardback Publisher A, and maybe a different cover page ill

        • I just compressed a 2.0GB plain text log file to 42M, so I imagine compressing a book would have roughly 40:1 performance. So 200TB would compress to more like 5TB. If we used a sort of cell-shading on colorful images with flat color regions, we could turn them back to simple images that blot down to PNG very well; same with black and white illustrations, although restoring colored pencil on paper would pose difficulty (really, we want to identify the paper and remove it in favor of a paper-like backgroun

          • Log files are typically very structured low-entropy data. With random natural-language text you seldom get better than 3 or 4 to 1 lossless compression. Image compression can do better, but that's typically already been done to get the JPG/PNG/GIF/etc., and it's typically lossy, and of course video compression is much better because most of an image doesn't change much from frame to frame. But in this case they're trying to OCR the data, so much of that image compressibility has already been replaced (be

            • Structure has nothing to do with compression: only content affects redundancy. In this case, a lot of it is IP and IDS logs, mixed with system logs, etc... it's a log of everything that touches syslog, basically. And bzip2 uses 900,000 byte blocks, so anything more than a meg away is irrelevant.

              English text compresses rather well in any case, as it is by nature well-structured and redundant. I used to analyze uncompressed English text through shitty encryption by shit as simple as two-tuple frequency c

    • I can tell this topic is going to be dominated by people who never had to deal with the internals of a revision-control system, much less a configuration-management system, because the issues are somewhat trivial once you get past your fear of the variables.

      • Also by people who have never read the article, where it explains in some significant detail how they try to determine what constitutes "a book" for the purposes of their counting.

        • Re: (Score:3, Funny)

          by dgatwood ( 11270 )

          You read the article?

          Impostor! Burn the witch!

        • by Smauler ( 915644 ) on Friday August 06, 2010 @02:44PM (#33167046)

          Look at textbooks - new editions that are almost indistinguishable from the previous editions have new ISBNs. Do we count every single one as a different book?

          From TFS : if they contain different ISBNs, then they definitely describe different books

          If they're using this method, GP's point is valid. The books are not really new books, they're essentially the same as previous editions but have different ISBNs. In essence, these new editions with new ISBNs are being counted twice (or more) for very small revisions to the same book.

          • Re: (Score:2, Informative)

            by natehoy ( 1608657 )

            From TFA: Well, it all depends on what exactly you mean by a “book.” We’re not going to count what library scientists call “works,” those elusive "distinct intellectual or artistic creations.” It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries. (emphasis mine)

            For Google's definition of what constitutes a unique work as used

            • by Smauler ( 915644 )

              I was not proposing a new method of counting books... I was only supporting the OP in his assertion that their method contains limitations regarding repetition of works with minor differences.

              I was mainly responding to those who just said RTFA without seeing basic facts in TFS.

    • by Suki I ( 1546431 )
      With the advent of self-publishing and individuals purchasing their own ISBN blocks, the possibility of different works getting the same ISBN increases greatly. Especially when they are not using a distribution service like Amazon that *might* check to see if that ISBN is already in use.
    • by Jeng ( 926980 )

      Also, if a publisher purchases a title from another publisher it gets a new ISBN with the new publisher even though it is the same book.

    • Re: (Score:2, Informative)

      by gpf2 ( 1609755 )
      What about translations? What about bootlegged copies from the 18th century? What about languages that have no direct concept of "editon?" The International Federation of Library Associations and Institutions (IFLAuhas been wrestling with this for a while. Their solution -- of sorts: Functional Requirements of Bibliographic Records (FRBR). http://www.ifla.org/en/publications/functional-requirements-for-bibliographic-records [ifla.org] Pretty dense and not consistently adopted.
    • It's a one-page article, and contains a really good explanation of what they mean by a book for the purposes of their counting, and why.

      The following sentence from the article really which cuts straight to the heart of their concept of uniqueness:

      It makes sense to consider all editions of “Hamlet” separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries.

      So, yes, if they scan textbooks they'll scan all versions they can

    • In the 1480s a edition of Dante's Divine Comedy was printed in Venice. In 1481 another was printed in Florence. Each is the exact same text barring printer mistakes and if you are lucky enough to have the Florence one which includes the plates; illustrations. Each is also an absolute work of art in its own right and distinct from the other. Should these be recorded as one book or two?
  • by Anonymous Coward

    estimate would be about 130 million, not 129,864,880

    • Re: (Score:3, Insightful)

      by SomeJoel ( 1061138 )
      But 130 million can't possibly be right! We better assign some false precision to make our estimate believable. Significant digits are for science teachers and marriage counselors!
      • Significant digits are for science teachers and marriage counselors!

        Ok, what am I missing here?

      • But 130 million can't possibly be right! We better assign some false precision to make our estimate believable. Significant digits are for science teachers and marriage counselors!

        Why stop at 8 or 9? 18 is much better and just as meaningful: 129,864,880.461938427

  • I'm almost done reading them all!

  • That's an ESTIMATE? (Score:4, Interesting)

    by wealthychef ( 584778 ) on Friday August 06, 2010 @12:58PM (#33165192)
    I'm very suspicious about their numerical precision. IF it's an estimate, then they are saying it's 129,864,880 +/- 10. That is, they are pretty sure there aren't 129,864,980 books. I think they should make their estimate something like "we think there are about 130,000,000" or whatever accuracy they actually believe.
    • For sure. Even gravity can't be specified to that many significant digits, and it's a bit more knowable than the number of books in the world.
    • Also, what is the date and time of this estimate? How many books are published a day around the world?
    • If you RTFA (blasphemy, I'm sure), Google doesn't say that 129,864,880 is an estimate - they say that is the number of books, total (at least until Sunday).

      The only estimate is mentioned is "16 million bound serial and government document volumes".

      Surprise surprise, subby is the culprit that turned such an exact number into an "estimate".

    • by city ( 1189205 )
      I'm suspicious about the accuracy of numbers in general, I use 'some' for a few things and 'many' for more. I estimate there are many books in the world.
    • You lose accuracy by representing error bounds simply by the significant digits of the number. It is convention-dependent that the last sig fig is assumed to be +/- 1 (zero being assumed non-significant unless followed by a decimal point, unless the zero is already after a decimal point). That's what I remember from high school chem. And it's a convention that makes sense for, say, reading a temperature off of a thermometer. You don't know if the actual value was rounded up or down to give the instrumen

  • Wow (Score:3, Insightful)

    by demonbug ( 309515 ) on Friday August 06, 2010 @01:11PM (#33165438) Journal

    They should write a book!

  • Who cares? Does it matter?
  • If you divide the number of books by the current world population, you get that there are one unique books for every 50 people, or on average one in 50 people wrote a book, including many poor, illiterate and children.

    Of course, some book writers have died and many have written more the one book, but I suspect that most books have been written recently and their writers are still alive.

    If you only include adults who live a comfortable western lifestyle, it may be as maybe as high as one in 10.

    • Re: (Score:3, Insightful)

      by SomeJoel ( 1061138 )

      I suspect that most books have been written recently and their writers are still alive.

      And I suspect that you are full of crap.

      • I wish I had mod points for you sir.
      • 90% of all scientists who ever lived are alive today and many of the books have been written by scientists.

        While the percentage may not be has high for all authors, but I think it would be close.

      • Given the enormous explosion in literacy and printing press technology over the last 100 years, I would say he's probably closer than you think. Also, it's estimated that human knowledge doubles every 7 years -- that would mean a doubling of the number of things written down or published.

        What would resolve this is to discover how many books existed 100 years ago, and 50 years ago.

      • by Smauler ( 915644 )

        A suprisingly large proportion of the humans who ever have lived are actually alive now (most people estimate it about 10%). It is _way_ easier now to publish a book than it was even 100 years ago.

        I'm not saying you're wrong about GP's assumptions made, but personally I'd guess he's right. That's just a guess though ;).

    • I suspect that most books have been written recently and their writers are still alive.

      Indeed, just yesterday I met Shakespeare. He was talking with Lewis Caroll and Douglas Adams. Unfortunately I couldn't talk to them, because Plato was just coming around the corner, arguing with Aristoteles and Kant about some philosophical problem, and I would have been in their way. On the other side of the room, Mao was arguing with the evangelists about who has written the better Bible. Karl Marx didn't help Mao, beca

    • by mcgrew ( 92797 ) *

      Isaac Asimov wrote over 500 books. I don't know know haw many Terry Pratchett has written but the number is in the dozens. There's Clarke, Heinlein, Niven... and those are just a few science fiction writers (yes, Asimov also wrote nonfiction and Pratchett is known mainly for fantasy). Serious authors write more than one book each.

      So your average is a little meaningless.

      • by pz ( 113803 ) on Friday August 06, 2010 @03:30PM (#33167772) Journal

        Isaac Asimov wrote over 500 books. I don't know know haw many Terry Pratchett has written but the number is in the dozens. There's Clarke, Heinlein, Niven... and those are just a few science fiction writers (yes, Asimov also wrote nonfiction and Pratchett is known mainly for fantasy). Serious authors write more than one book each.

        So your average is a little meaningless.

        No, averages are very meaningful. Extremely meaningful. They are the AVERAGE (usually the mean), which means that some values will be above, and some values will be below. The idiocy comes in when people mistakenly jump to the conclusion that just because an average exists, it means that every value must be exactly the same as the average. Or, just because you can find extreme values far away from the average that again the average is not meaningful.

        If the average states that 1 in 50 people have written a book, then, by gum, it will be easy to find plenty of people who have written zero books, somewhat fewer who have written exactly one (something below 1 in 50), much fewer who have written exactly two, even fewer who have written exactly three, etc. That does not mean that example authors with hundreds of books cannot exist, it only bounds how frequent they can be.

        Of the myriad of ideas that the academic community has utterly failed in educating the general public about, it's the relationship between averages and distributions. One more time: just because an average exists, it does not mean that every datum has the same value as the average. As an example, just because the average male in the US is 5' 9", it does not mean that every single male is that tall, nor that you will not find ones that are shorter, taller, or even much shorter or much taller. The tallest man (according to my 20 seconds of research through Google) was 8' 11", and the shortest was 1' 10" ... does that lessen the meaningfulness or utility of the average male height? Rather the contrary: it provides important information as to the extent of the distribution of heights.

        Now, I suspect that the parent poster is trying to say that because -- by loosely founded speculation -- most authors are professional authors ("serious authors") and therefore will have more than one book to their name, the classification of people into authors and non-authors will be skewed against 1:50. I would not argue against that (in fact, I indirectly argued for it above). Nevertheless, using the utterly non-scientific sample of the books above my desk, most authors have only one book to their name, so the number isn't going to be much worse than 1:50, perhaps 1:55 or 1:60. That kind of pure, unadulterated speculation is exactly the sort I would love to see proved wrong with hard data.

        • by mcgrew ( 92797 ) *

          One more time: just because an average exists, it does not mean that every datum has the same value as the average.

          That was the point I was trying to make. If there is one book written for each group of fifty people, the average would be one in fifty but the actual number of authors would be less than one in fifty people, probably far less. But as you say, there's no way of knowing how much less without actual data.

  • Qoh.12 [12] ... Of making many books there is no end,

  • The same checksum they use for UPC codes. Sum up the 10 significant digits. Then take that sum(S) and push up to the next tens unit(T). The difference of T-S = check digit.

    E.g. UPC code 54556 39824. Sum is 51. Next tens is 60. 60-51=9 so the check digit is 9. The same basic formula could work for ISBN numbers too.
    • ISBN check codes are designed to catch common errors back when hand entry was common -

      a run of two digits in the wrong place (eg 556 instead of 566)
      a mistyped digit
      two digits swapped around by one place

      The UPC code does not support the latter at the expense of only requiring the check symbol to be one of 10 regardless of the number of digits in the code. The ISBN algorithm requires n+1 where n is the number of data digits. Whether this is required nowadays given that very few ISBNs are entered by hand is

  • by Anonymous Coward

    129,864,880 different books? What is that in Libraries of Congress?

  • by andrewagill ( 700624 ) on Friday August 06, 2010 @02:36PM (#33166920) Homepage
    How about the books that people write and spread around to friends or books published by small in-house printshops, often as promotional material? Books written before ISBN that are still in libraries but no longer published (Bodoni's type specimens come to mind, though it looks like some of these are indeed catalogued by WorldCat)? Books that were printed years ago that we know we lost to the ages (the lost Gospel of Barnabas--not the forged Gospel of Barnabas--comes to mind). What about the books that we never knew existed?

    This estimate isn't bad for published works, but it does not adequately answer the question posed, ``Just how many books are out there?''
  • by bcrowell ( 177657 ) on Friday August 06, 2010 @04:17PM (#33168536) Homepage

    ISBNs suck as identifiers for digital books, especially digital books that are free. There are two problems.

    Problem number one is that they cost money. Let's say someone writes up a really nice manual documenting some open-source software. He wants the manual to be free, just like the software. But now if he wants an ISBN, he has to pay money to get the ISBN, which means expending dollars on a book that is not going to be bringing in any dollars. The fact that ISBNs cost money is out of step with the fact that we have this thing called the World Wide Web, which is basically a huge machine for letting people do publishing without the per-copy costs that are associated with print publishing.

    The other problem is that ISBNs are supposed to uniquely identify an edition of the book. This makes sense for traditional print publishing, where the economics of production forced people to make discrete editions widely spaced in time. It makes no sense for print on demand or for pure digital publishing. I've written some CC-licensed textbooks. When someone emails me to let me know about a typo or a factual error, I fix it right away in the digital version, and I usually update the print-on-demand version within about 6 months. No way am I going to assign a different ISBN every 6 months.

    We can say that ISBNs are for printed books, not for ephemeral web pages, but that doesn't really work. The two overlap. My textbooks exist simultaneously as web pages, pdf files, and printed books. Amazon sells a book for the kindle using one ISBN, assigning a different ISBN to the printed version. Print-on-demand books share some characteristics with printed books (e.g., they're physical objects) and some with the web (can be updated continuously).

    By the way, why do you think library catalogs don't show ISBNs? It's because ISBNs are meant as commercial tools, like the barcode on a box of cereal. If google finds ISBNs useful for other purposes than selling copies of books, it's probably because google is trying to deal with a massive number of books using a minimum amount of human labor.

  • OK, I'm a bad little slashdotter, I actually RTFA. I noticed a few things:

    1)TFA actually acknowledges that the ISBN is very North America-centric, but the other cataloging types are also either N.A-centric or at least western world-centric.
    2) The entire article is based on efforts to simply compile a list of books by aggregating and loosely filtering/sorting several other lists. The lists mentioned are, as far as I know, all heavily biased toward 19th and 20th century works. (The article explicitly menti

  • and there are even more in Lucien's library in the dreaming.

If you aren't rich you should always look useful. -- Louis-Ferdinand Celine

Working...