Conservative skewing in Google N-gram frequencies
Posted by schrisomalis on July 14, 2013
Google Ngram Viewer is a great tool, especially for rough-and-ready searching and visualization of linguistic trends, and as a teaching tool to introduce students to lots of interesting questions we can ask about language variation and patterning. I use it all the time. The default search parameters are for 1800 – 2000, and the Culturomics project notes that, “the best data is the data for English between 1800 and 2000. Before 1800, there aren’t enough books to reliably quantify many of the queries that first come to mind; after 2000, the corpus composition undergoes subtle changes around the time of the inception of the Google Books project.” Elsewhere, the Culturomics FAQ notes that, “Before 2000, most of the books in Google Books come from library holdings. But when the Google Books project started back in 2004, Google started receiving lots of books from publishers. This dramatically affects the composition of the corpus in recent years and is why our paper doesn’t use any data from after 2000.”
OK, so we’ve been warned that the data from before 2000 is very different than the data from after 2000, and especially that 2004 marked a significant change in the corpus. Caveat lector, or whatever you will. But I want to know: In what ways have these ‘subtle changes’ changed the Google N-gram corpus, and therefore, what biases in word frequencies do scholars of language need to account for?
Lately, I’ve had some interest in post-2000 changes in word frequencies for my Lexiculture class project for the fall, and so I’ve been looking at N-gram data going up to 2008 (the last date you can search). I have found some very weird declines in words that probably aren’t actually declining in relative frequency:
It seems notable that all of these words start to decline shortly after 2000, with a particularly steep decline right around 2004-05. All of these words, I would argue, should be stable or increasing in frequency: these are words associated with modern technology and social life. Conversely, many timeless words (e.g., table, lamp, daughter) are flat or rising after 2000. It’s possible that intuitions about what should be happening to words can be wrong. But why are they all wrong in the same direction, and why do they all decline all at the same time?
- One possibility is that the data from 2000 onward aren’t complete yet. There could be some books published over the past few years that haven’t been integrated into Google Books and thus don’t end up in the Ngram viewer. But in any case, n-grams measure a word’s frequency relative to all words published in that year, so the fact that the collection isn’t complete should not affect relative word frequencies at all.
- It’s possible that Google Books has systematically missed archiving books oriented towards technology, but why would that be the case? In fact, if tech-savvy publishers are more likely to submit their works to Google Books (which I think is plausible) than your average publisher, the effect should be to increase these words’ frequency.
- It’s possible that, in the absence of the controlled digitization of books from libraries that characterized the early period of Google Books digitization, and the work done to manage metadata in creating the N-Gram Viewer’s early dataset, massive error has crept into the database. But again, why would this affect particularly modern words negatively, while not affecting words whose frequencies has not been changed?
I think I have a better answer. I think that the N-gram Viewer may be skewed, not because anything significant is being missed, but because something significant is being added. There is a growing tendency for cheap electronic reprints of public domain books to come out and be immediately included in Google Books, with the publication date listed as the date of its electronic reprinting. If Levi Leonard Conant’s book The Number Concept (1896) is scanned and reprinted by Echo Books in 2007, the Google Books metadata doesn’t recognize it as an 1896 book at all. It’s digitized and scanned twice, once (correctly) as an 1896 book and again as a 2007 book. In fact, because it’s in the public domain, I could make my own e-book version for sale as a 2013 book and have it listed again. And while that’s not likely to have a huge effect, imagine every reprint of A Tale of Two Cities or Wuthering Heights that has flooded the market since the invention of e-books, stimulated by and reinforced by projects like Google Books.
Now, I suppose there is a case to be made that the 2007 reprint of Conant is, in some way, a 2007 book. After all, reprints have never been excluded from Google Books and there are plenty of pre-electronic 20th century reprints of Wuthering Heights in the corpus. But each of those earlier reprints represents a costly decision by a publisher that a particular book is important enough and will be read widely enough to warrant its republication. From a ‘culturomics’ perspective, there’s a case to be made that these reprints really constitutes a cultural ‘signal’ in the year of its reprinting, and from a linguistic perspective, we presume that lots of readers will read the words, no matter if they are obsolescent at the time. But as the cost of producing reprints as e-books (or print-on-demand) declines, the ‘culturomic’ value of these books also declines, because publishers no longer need to be concerned about whether many (or even any) people buy these books. The author is long dead, so there are no royalties, and there are no or minimal up-front publishing costs. So Google Books is now being flooded with material that may be largely unread and does not reflect the linguistic or cultural values of the time. Its primary effect, for the N-gram viewer, is to skew relative word frequencies in a way that makes 2013 resemble 1913 more than it actually does. That’s a conservative bias, for those following along at home.
We can then derive a couple of corollaries to check if this theory is correct:
- There are likely to be some words that, while still increasing in frequency, do not increase in frequency quite as much as their actual use should indicate. These are words that have shot up out of nowhere over the past few years, and are continuing to accelerate, but their N-gram shows a tapering off. We see a great example of this in a word like transgender, where we see, right around 2004, a clear decline in the acceleration of its frequency, counter to expectations.
- If some word frequencies are artificially depressed, some other word frequencies must be artificially inflated. But which ones? There are likely to be other words that were very common in the 19th and early 20th centuries (the period where most of these reprints are going to come from), but have been on the decline for a long time and are now quite rare, that show an apparent ‘rejuvenation’ after 2004. Again, we find such a word: negro (uncapitalized), which is virtually non-existent in contemporary written English but was at its peak in the period from 1880-1920, and which shows a clear ‘bump’ after 2004 which can’t possibly be real. You can even see this to a lesser degree with a word like honesty, which (for reasons perhaps best left unanalyzed) had been in decline throughout the 20th century but experiences a bump, again, right around 2004.
In summary, because the Google Books corpus today is derived largely from publisher submissions, and because there is a major signal coming from reprints of public domain books published before 1922, n-grams from 2004 onward (and, to a lesser degree, from 2000-2004) are skewed to make modern words appear more infrequent than they actually are, and obsolescent words more common than they are. The moral is not that Google is evil or conservative or that culturomics is stupid or that the N-gram Viewer is fatally flawed. I do think, nonetheless, that we ought to be aware that the specific kinds of unintentional skewing that are being produced are ones that tend, in a conservative direction, to replicate the linguistic and cultural values of a century ago. This problem is not going away, absent a systematic effort to eliminate reprints from some future N-gram dataset, and it may even be getting worse as electronic reprints become more and more common. Stick to the pre-2000 data, though, just like they advise, and you’ll be in good shape.
Thanks to Julia Pope for her consultation and assistance on aspects of Google Books metadata and cataloging practices.