Google Books Makes a Word Cloud of Human History 127
An anonymous reader writes
"From Ed Yong at the Not Exactly Rocket Science blog: 'Just as petrified fossils tell us about the evolution of life on earth, the words written in books narrate the history of humanity. The words tell a story, not just through the sentences they form, but in how often they occur. Uncovering those tales isn't easy — you'd need to convert books into a digital format so that their text can be analyzed and compared. And you'd need to do that for millions of books. Fortunately, that's exactly what Google have been doing since 2004.' Yong goes on to explain that the astounding record of human culture found in Google Books offers new research paths to social scientists, linguists, and humanities scholars. Some of the early findings (abstract), based on an analysis of 5 million books containing 500 billion words: English is still adding words at a breathtaking pace; grammar is evolving and often becoming more regular; we're forgetting our history more quickly; and celebrities are younger than they used to be. You can also play with the Google Books search tool yourself. For example, here's a neat comparison of how often the words Britannica and Wikipedia have appeared."
OCR errors (Score:5, Interesting)
AFAIK, Google Books doesn't do the sort of methodical OCR clean-up that Project Gutenberg does, so a lot of Google's digitized books have a a fair number of errors. It'd be funny to see what kind of blips this might creates in our extracted cultural history!
Re: (Score:2)
Re: (Score:3)
From Google's "about" page [googlelabs.com] for their Books Ngram Viewer lab: "Why does the word 'Internet" occur before 1950?"
Re: (Score:3)
A simpson quote where lenny as a kid talks about the netting in his shorts, the internet, and later says "I think I just logged onto the internet" comes to mind...
Re: (Score:1)
One of the sample plots in the article is a plot comparing the frequencies of George Washington, Thomas Jefferson, and Abraham Lincoln. If you look at the plot, you'll notice that Lincoln has a nice uptick in name usage about 10 years before he was born.
Re: (Score:3)
A little more digging finds this little gem [google.com]. Which appears to just be mis-dated. I suspect it was written in 1890 from looking very carefully at the copyright page.
It also very possible that some of those references are to others people with the same name. Like this one [google.com] and this one [google.com].
Re: (Score:1)
I wonder if it had to do with President Lincoln's grandfather? He had the same name and was a captain in the revolution.
Re: (Score:3)
Yes, here's an amazingly precient book from 1920 101 Successful Businesses You can Start on the Internet [google.com]
Re: (Score:2)
Re:OCR errors (Score:4, Funny)
Case sensitive? (Score:5, Informative)
Interesting that it is case sensitive. Searching for "britannica,wikipedia" in lowercase, produces, for today, close to zero for brittanica, and 0.00005% for wikipedia, which is not far off the result for Wikipedia (with capital).
Putting these together, the case-insensitive comparison of brittanica and wikipedia has wikipedia already well ahead of brittanica, at around 0.00010% for britannica, vs 0.00013% for wikipedia.
Re: (Score:2)
Re: (Score:2)
You should try republic vs tyranny [googlelabs.com]. Some odd correlation there.
Re: (Score:1)
I read that as republic vs. tranny. The funny thing is, I probably wouldn't have clicked if I read it correctly.
Re: (Score:3)
Freedom has always been popular, but since the early 19th c. it's gone much better with "democracy" than with "republic". [googlelabs.com]
Re: (Score:2)
Re: (Score:2)
War [googlelabs.com] is pretty straightforward...
Re: (Score:2)
Also, everything is a crisis [googlelabs.com] nowadays!
Re: (Score:2)
Ok, I must not know something about the phrase war pigs [googlelabs.com]
Re: (Score:2)
If you click on the links at the bottom, some of them show multi-word combos like:
"Before the war, pig-iron"
"On board vessels of war pig-iron"
Re: (Score:3)
We might also note that big peak in the incidence of "Britannica" in the early 1800s. But back then, it was still expected that educated people (at least in Europe) would study Latin, and "Britannica" is merely a Latin adjectival form of "Britannia", or "Britain", and the British Empire was rather active around the world at that time. So most of the uses of "Britannica" around then probably had nothing to do with the encyclopedia.
I'd guess that you'd also find a fair number of occurrences of "Britannica"
"Britannica" in the 1500s: more than just a book (Score:2)
The word "Britannica" doesn't just refer to Encyclopaedia Britannica. It means 'of Britain' (latin scholars can help me with the exact meaning but this is its general sense). So you'll get hits from before when the encyclopaedia existed, back to at least 1500 according to the search tool. And some hits from after the books started won't refer to them. It's a poor choice of comparison for a search.
Slashdot circa 1885 (Score:5, Funny)
Sometime around 1885, the very first Anonymouse Cowarde briefly tried writing about Slashdot, but apparently died off before his comments could be modded up.
Re: (Score:2)
That is very, very odd. It appears to be in 1899:
http://ngrams.googlelabs.com/graph?content=slashdot&year_start=1800&year_end=1960&corpus=0&smoothing=0 [googlelabs.com]
but a further search turns up zero results. If it were a OCR-o, it should at least show up.
There is another hit, labeled 1963:
http://books.google.com/books?id=x-O2AAAAIAAJ&q=%22slashdot%22&dq=%22slashdot%22&hl=en&ei=9bwLTaTADoet8Abf1qT7DQ&sa=X&oi=book_result&ct=result&resnum=1&ved=0CCQQ6AEwAA [google.com]
but it's a bad
Re: (Score:2)
Re: (Score:2)
The Economist dates to 1843.
There may have been some format change that makes 1963 special, or it may be that their records start there. But I doubt that's the only mention of Slashdot on the Economist, so I suspect it's just that one issue that's misdated.
(A search at The Economist turns up two hits, both from 1999, but from different issues. I'm surprised that there isn't something more recent than that, and I suspect their search is flaky. Neither one is the article that the Google search turned up, w
Re: (Score:1)
There are some number of modern works that are for some reason cataloged at the turn of the last century. Try Internet for similar results.
So it's just like Google search then? (Score:2)
Search for Slashdotte: 415 results. Go to page 9 of the results: Now there's only 89 results.
Re: (Score:2)
I followed your link, replaced "slashdot" with "LOL" and realised that we are living in the happiest of times. The same search for "pwned" shows how writers must have been tremendously aggressive in the late 1800s.
Probably only one answer (Score:2)
Re: (Score:2)
Smoothing creates bias (Score:2)
Note that in the linked Brittanica / Wikipedia chart, Britannica appears higher due to smoothing being set as it is. Set it to a lower value, which gives a less pretty, more accurate chart, and Wikipedia is much higher by the present day.
Fuck's Great Comeback (Score:2)
Up until the 1820s, Fuck was apparently very much in vogue. Not until 1960s was this great word brought back into the lexicon of the common man.
Re: (Score:2)
http://ngrams.googlelabs.com/graph?content=fuck&year_start=1800&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
Up until the 1820s, Fuck was apparently very much in vogue. Not until 1960s was this great word brought back into the lexicon of the common man.
Click on the time period from 1800 in the lower left and you'll see search results with some of the context. Oftentimes it seems to unfortunately be an OCR error (lambs and calves fucking milk). Maybe there was a font in use at the time
Re: (Score:2)
Maybe there was a font in use at the time with an f that resemble(d/s) an s...
Exactly. Well, almost. Not so much a font, but a convention where an initial 's' (or all but the final 's') used a character that looked something like an 'f' and a little like an integral sign (or 'fign'). A lot of old documents use that. I have a 200-year old chemistry text (handed down from a great^n grandfather) which proclaims itself "A Complete Courfe in Chymiftry", except that the 'f' isn't quite.
Re: (Score:1)
Here, take a broader look.
People may complain about filthy language these days, but daaaaaamn! Our founding fathers must have had -filthy- mouths, and I'd -really- like to know what that spike in the late 1500s was about.
Re: (Score:2)
That's rather easy, click on the link at the bottom for 1500-1665 [google.com]. A lot of OCR errors, it looks like.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Yahoo was also very popular back in the early 1800's and fell our of favor just the same.
Re: (Score:2)
Re: (Score:2)
It's actually a medial s character, rather than an f. At some point the medial s was gotten rid of in favor of the final s.
Re: (Score:2)
Email's Great Comeback too! (Score:2)
Not until 1960s was this great word brought back into the lexicon of the common man.
Oddly enough, email was a pretty popular word from up until the 1960's, peaking in popularity in the 1860's [googlelabs.com], but has made a comeback since the mid 1990's!
Re: (Score:2)
Re: (Score:2)
Re:Fuck's Great Comeback (Score:4, Informative)
Most of the actual hits there appear to be OCR-os for the word "suck" and "such", often due to the use of medial "s" that resembles an "f". The word "such" appeared on a page which was badly speckled.
Given that the word "suck" was often used in the expression "to give suck", many of those pages are quite hilarious ("she would not suffer the strange lamb to fuck"). I didn't see any actual "fucks" in the first few pages of hits.
I know that the word was known. Shakespeare made a sly reference to it in Merry Wives of Windsor. But I suspect it wasn't often set down on paper, at least not in the kinds of books that got preserved.
Re: (Score:3)
Which means, incidentally, that the trailing off of "fuck" at the beginning of the 19th century IS very interesting, for a different reason. It's watching the tail end of the use of the medial "s".
That's the kind of data that would have been really hard to gather any other way, unless the OCR were to distinguish between medial "s" and regular "s" in its results. There IS a Unicode for medial S, but most OCR doesn't go there.
So, we have a proxy for it: "suck" scanned as "fuck", which wouldn't otherwise app
Re: (Score:1)
wasting time (Score:2)
now I spent almost an hour fooling around with this today
Naughty Words (Score:2)
Inverse correlations (Score:2)
Re: (Score:1)
My first search was between demon and epilepsy.
http://ngrams.googlelabs.com/graph?content=demon%2Cepilepsy&year_start=1800&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
The dip in the 30s was quite interesting to me.
Lies, Damned Lies and Statistics (Score:1)
Hmmm... So Britannica still on top?
But this link (is with smoothing=0) gives a different result:
http://ngrams.googlelabs.com/graph?content=Britannica%2CWikipedia&year_start=1800&year_end=2008&corpus=0&smoothing=0 [googlelabs.com]
Not that I know whether smoothing=0 is better or worse then smoothing=3
Kind regards,
Roel
A bit sparse of an article (Score:5, Interesting)
I wish they had gone in the article into more depth about grammar changes, rather than just word forms. For example, sentence ordering, comma usage, and some various other grammar items would be more intriguing. I found the burnt/burned the most interesting comparison because it showed an example of two competing versions of a word.
Interesting idea, but as was stated in the article, there are definite limits to what this technique can study, and many are unconvinced of its value for more than highly limited problems.
War&Peace (Score:1)
http://ngrams.googlelabs.com/graph?content=doctor%2Clawyer%2Carchitect%2Csoldier%2Cpoliceman%2Cdog+walker&year_start=1500&year_end=2000&corpus=0&smoothing=3 [googlelabs.com]
http://ngrams.googlelabs.com/graph?content=peace%2Cwar%2Cmoney&year_start=1500&year_end=2000&corpus=5&smoothing=3 [googlelabs.com]
Re: (Score:2)
Re: (Score:2)
mechanical vs electronic vs electric vs magnetic vs hydraulic vs pneumatic [googlelabs.com]
library vs internet [googlelabs.com]
Books before 1905 with "Internet" in them [google.com]
book vs computer [googlelabs.com]
horse vs carriage vs car vs aircraft (notice the "noise level" for car in the mid-1800s and earlier) [googlelabs.com]
terrorism and terrorist (change start to 1900 for a closer view) [googlelabs.com]
This is better than discovering new oil reserves (Score:1)
The richest data mine in the whole world... and probably bottomless..
I call BS (Score:1)
On the other hand, they seem to have pegged this one [googlelabs.com]!
Re: (Score:1)
Re: (Score:2)
Perhaps if you used the correct [googlelabs.com] bigrams [googlelabs.com] instead of uncommon contractions of them.
From TFA (Score:2)
Rather than expose the full texts to the public (and themselves to copyright infringement)
But wait, I thought you were breaking the law just by scanning the books and creating unauthorized copies. Or is there a different law for corporations like Google?
Re: (Score:2)
it doesn't matter, it's retarded either way
we can't actually READ these texts... drum roll please... that in most cases no one can get their hands on anyways, they are so obscure. because someone might lose money, theoretically, THAT THEY ALREADY AREN'T MAKING. however, if these texts were made freely available, there would be renewed interest in some of these obscure works and someone would definitely make ancillary revenues off of them
google is providing free exposure for rights holders and grandchildren
John Lennon (Score:2)
thought this was more interesting than the summary's example:
http://ngrams.googlelabs.com/graph?content=peace%2C+love%2C+understanding&year_start=1800&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
Hippies (Score:1)
Google VS Yahoo (Score:2)
Google Vs Yahoo [googlelabs.com]
"the" vs "of" is also exciting......I will be following this contest for the rest of my life.
The vs Of [googlelabs.com]
Is another worth more common than "the"?
tl;dr (Score:3)
Re: (Score:1)
Re: (Score:2)
The Cola Wars (Score:1)
Where's Buffy! (Score:3)
Re: (Score:2)
Leadership (Score:1)
This truly looks phenomenal (Score:1)
Easter Egg (Score:2)
Re: (Score:2)
This is more concerning:
http://ngrams.googlelabs.com/graph?content=pirates%2Cglobal+warming&year_start=1800&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
More pirates should mean less warming!
New York Word Exchange (Score:2)
Seeing the graphs of word popularity over time reminds me of that old Saturday Night Live skit [jt.org] with Phil Hartman giving word investing tips.
Rickrolled easter egg (Score:4, Funny)
Another easter egg (Score:1)
Party Graphology (Score:2)
It's said that liberals have issues and conservatives have principles. Plug "issue,principle" into it and see a really good picture of Western political change.
Global Temperatures (Score:1)
Google Books vs. real corpora (Score:4, Informative)
Corpus of Historical American English.
-- 400 million words, 1810s-2000s.
-- Allows for many types of searches that Google Books can't:
* accurate frequency of words and phrases by decade and year
* changes in word forms (via wildcard searches)
* grammatical changes (because corpus is "tagged" for part of speech)
* changes in meaning (via collocates; "nearby words")
* show all words that are more common in one set of decades than others
* integrate synonyms and customized word lists into queries
* etc etc etc
-- Funded by the National Endowment for the Humanities (NEH), 2009-2011.
Take a look at the "Compare to Google/Archives" link off the first page.
Re: (Score:2)
Your corpus is clean and balanced -- Google's is 1200 times bigger.
Your front-end is powerful but complicated -- Google's is simple and usable by regular people.
Your front-end can handle the load of a few academics -- Google's can handle getting slashdotted, in the mainstream press, etc.
I kind of see them as complementary. If you'd like, I could get you in contact with the folks who made the Google system; they'd probably be open to someone working to bring more structure to it, or just hosting fancier-but
Communism vs. Terrorism (Score:1)
Yikes! (Score:2)
The sciences (Score:2)
http://ngrams.googlelabs.com/graph?content=physics%2Cchemistry%2Cbiology%2Cmathematics&year_start=1700&year_end=2008&corpus=0&smoothing=6 [googlelabs.com]
In memory of George Carlin (Score:2)
I wonder why the (ever so slight) drop of "euphemism" near the present bothers me... [googlelabs.com]
Cardinal Directions (Score:2)
http://ngrams.googlelabs.com/graph?content=North,South,East,West&year_start=1700&year_end=2008&corpus=0&smoothing=3
The directions "North" and "South" were more than an order of magnitude more popular than "East" and "West" until ~1800, when they quickly caught up over the course of a decade or so. Perhaps this is due to the American revolution, but I noticed that lower-case versions of all four words didn't become popular until about the same time, as well.
Neat...but to put human history in perspective go (Score:2)
Obligatory (Score:2)
I don't know how this was missed earlier, but:
http://ngrams.googlelabs.com/graph?content=sharks,lasers&year_start=1770&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
American vs British (Score:1)
Comparing corpus is much more interesting (Score:2)
You might compare English to German for example and have a look at what it looks like around the world wars.
Careful on translations though. Few words are direct translations meaning exactly the same.
Most arguments are based on people having different meanings assigned to words in their head and not realising actually.
At the moment I have to do this by making the PNG transparent and overlaying. I'd love to know how to do it automatically. It's facinating to see different languages reacting differently to wor
Ninjas vs Pirates (Score:1)
Bah, uncultered barbarians. (Score:2)
"Britannica" can reference to other things than said encyclopaedia. This [googlelabs.com] gives a different picture.
Re: (Score:3, Insightful)
Because if you have a time machine, I've got some business plans that could make us both filthy rich...
Re: (Score:3)
Oh yeah, the only thing that ever matters is when a self-selected sample of writers puts words on paper. Nothing else matters.
I don't know that anyone besides yourself actually made that claim...
What is the percentage of humans who have lived? And what percentage of those humans got book deals
If we're talking about human history here, not many published authors actually had to get book deals. Those are a fairly recent occurrence.
and successfully negotiated the minefield to get not only published, but indexed by a 15-year-old company?
Google is indexing everything they can get their hands on. It isn't like you have to pay an entrance fee or anything.
Surely this is the sum of all human knowledge! How could it be otherwise? Oh, no, my anti-intellectualism is showing! How dare I question my betters?
The fact of the matter is that the important stuff is usually what gets written down.
Genealogies, religious texts, laws, business records, etc.
And even if it's fiction, it's generally a
Re: (Score:3)
History isn't what really happened, it's what got written down. Everything else is evanescent (well, except for what archaeologists can dig up and reconstruct, which isn't much and not necessarily accurate -- and it only counts if they write it down). Mind, I'd be more impressed if Google were also tracking the content of every hieroglyph and cuneiform tablet ever found.
It will ever be thus, unless someone invents a time machine (or at least a time viewer).
Re: (Score:2)
Mind, I'd be more impressed if Google were also tracking the content of every hieroglyph and cuneiform tablet ever found.
It will ever be thus, unless someone invents a time machine (or at least a time viewer).
I suspect they plan to...
Re: (Score:1)
I'm going to start appending that to the end of my posts in a futile, silly attempt to defend ridiculous, unfounded assertions I make. Oh, no, my anti-intellectualism is showing! How dare I question my betters?