Well, with regard to the study of California language diversity I talked about a few days ago, my students rightly think that using contemporary satellite images of California vegetation overlaid with potentially-unreliable  colonial-era ethnolinguistic data is probably not a good way to figure out why people 12,000 or 8,000 or 1,000 years ago moved where they did.  And I haven’t even taught them anything about the perils of glottochronology yet.    Also worth noting: no linguists were involved in the writing or evaluation of that paper at any stage, as far as I can tell.

So for those of you following along at home, on Thursday in class we’re going to be tackling yet another rather dubious piece of scholarship (and scholarly reporting) from last month: Patricia Greenfield’s research using the Google Ngram Viewer to study trends in personality in British and American societies as expressed through word frequencies; the study is ‘The Changing Psychology of Culture from 1800 to 2000‘ from Psychological Science and the news article is “Language in books shows how we have grown more selfish” from the Telegraph.   Advance feedback in comments is welcome.

Eellogofusciouhipoppokunurious, and other monstrosities

Posted by schrisomalis on September 1, 2013

Over at my obscure words website, The Phrontistery, there’s been a word that has been the subject of many astonished inquiries over the years: eellogofusciouhipoppokunurious, which means simply ‘good’.   At 30 letters, it’s the longest word on a site that’s full of them. More to the point, because my site is one of the most prominent places you can find the word, and because it doesn’t appear in any standard dictionaries (including the mammoth Oxford English Dictionary), over the years, I have had many people write to ask whether it is in fact a real word at all.

So to try to answer this question, first let me tell you about how eellogofusciouhipoppokunurious came to be in my list.  Back at the dawn of the Internet (well, OK, more like 1996), I had no idea that my word list would still be around (and over twice as long) seventeen years later.  Nowadays, I wouldn’t rely on a source like this for adding words to my list, but I was less picky back then, and I took words wherever I found them.  Combing my old email (I admit it – I have a complete record of all my emails going back to 1995.  But questions like this are why I hoard them), I discovered that I found the word in something called the Slang Teasers Dictionary, vol. IISlang Teasers is a game like Balderdash except instead of cards with words on them, there is a little silver paperback dictionary from which you pick words.  It was published in Canada in 1985, and as far as I can tell is generally forgotten today, although you can still buy the book used on for $50 if the mood strikes you.  Anyway, I still have the book, and there is eellogofusciouhipoppokunurious, defined as ‘very good; very fine’.

But where did the creators of the game find it? Whenever I’ve been asked, I haven’t really had a satisfactory answer.  I’ve told various people that I suspected it to be a nonce-word – that is, a word created ‘for the nonce’, to solve a one-time need in communication, without any expectation that it will become standardized or widely accepted.    The fact that its first six letters, read backwards, spell gollee (whereas eellog is very odd according to the patterns of English orthography) suggest that its inventor may not have been entirely serious. Nonce-words can end up becoming used more widely – that’s often why they end up in dictionaries – but they start out in a single specific context and aren’t expected by their creators to go any further.    There are tons of nonce-words created to mean ‘good’; if I say to you, “Wow, this donut is superfantrobulous’, you’ll know what I mean, even though I just made it up.  Most of them are never written down and never repeated.

A short while ago, when I created my Long Words page on the Phrontistery, I returned to the vexing subject of the origin of eellogofusciouhipoppokunurious and after a couple minutes’ searching, turned up this entry over at the wonderful Futility Closet, which at least gave me a source in an actual dictionary, Weseen’s Dictionary of American Slang, but seemed unlikely to go further.  A search on Google Books produced a couple of false positives (no, it wasn’t really used in some novel from 1845), some modern reuses (analogous to my Slang Teasers), and one from a 1934 review of Weseen’s dictionary. Nevertheless, I tracked down a copy of Weseen’s dictionary, which I just got on Thursday from my friendly interlibrary loans department, and sure enough, there it is, defined as ‘very good; very fine’ but with no further information.

So where did Weseen find it? With nothing relevant in Google Books (and in fact, nothing in several other sources I use from time to time), it seemed as if I had reached a dead end.  However, two facts gave me hope:

1) While Weseen’s dictionary is fairly simplistic (he provides only words and one-line definitions), he was no crank; he was (according to the title page), ‘Associate Professor of English, University of Nebraska’ and ‘Author of Crowell’s Dictionary of English Grammar and Handbook of American Usage; Words Confused and Misused; Write Better Business Letters; Everyday Uses of English.  His introduction is clear and well-written and emphasizes the wide range of texts and spoken contexts where slang is found, suggesting to me that he wasn’t just making stuff up.

2) The fact that the word didn’t show up in a search for eellogofusciouhipoppokunurious is not immediately fatal because there can be many opportunities to misspell such a word (one l or two near the start?  shouldn’t it be ‘hippopo’ rather than ‘hipoppo’?, etc.), and many opportunities for Google Books’ optical character recognition (OCR) to get it wrong as well.

But despite multiple attempts at respelling, nothing came up, and I was starting to get frustrated.   So I changed strategies, and decided to search for some of the other weird-looking words in Weseen’s dictionary.  I found an awful lot of them that seemed to have come from early issues of Dialect Notes, an early publication of the American Dialect Society and a predecessor to its current journal, American Speech.  Fortunately, much of this journal (at least, that part published pre-1922) is in the public domain and available online.     A substantial number of searches from Weseen’s words ended up going to articles by Louise Pound, a major American folklorist and dialectologist, the first female president of the Modern Language Association, and one of the founders of American Speech, who was … wait for it … a professor of English at the University of Nebraska, along with Weseen.  And sure enough, a little more searching turned up the elusive reference I had been looking for: Pound’s 1916 article “Word-List from Nebraska (III)”, Dialect Notes 4(3): 271-282, which is a list of slang terms she collected from her students in the early 1910s.  And lo and behold:


(Pound 1916: 274)

We can also see immediately why Google hadn’t turned it up in my searches – because it was broken up using hyphens and dots, it didn’t turn up as a whole word.  (I believe that the dots are being used to indicate stress, while the hyphens are orthographic – i.e., they’re meant to be used when the word is spelled out, even though none of my later sources do so.   The source is listed as a contributor from western Oregon, but Pound also assures us in her introduction that “Unless note to other effect is made, each word on the list was known to at least six people, coming generally from different sections of the state” (Pound 1916: 271).  I’m not sure whether the note is meant to suggest that in this case, only one student, from Oregon, knew the word, or whether others from Nebraska also were familiar with it.

So at this point I could have been satisfied that Weseen got it from Pound and then called it a day.  But then I found this very curious entry just two pages later:

(Pound 1916: 276)

(Pound 1916: 276)

Now, it didn’t define hypoppercanorious for me, but one doesn’t need advanced training in linguistics to see that it’s basically eellogofusciouhipoppokunurious minus the eellogo and the fusciou.    This one was known by students both from Nebraska and Massachusetts.  The entry sent me off to the previous volume of Dialect Notes, and to yet another article by Pound and to yet another word, flippercanorious, defined as ‘fine, grand’ and indicated to have been used in Nebraska.    And indeed, while hypoppercanorious is not in Weseen, flippercanorious is there, defined as ‘grand; elegant’.

So these three words all generally mean ‘good’ or ‘fine’ or ‘grand’ and presumably eellogofusciouhipoppokunurious, with all those extra morpheme-looking bits on the front end, is intensified and so means ‘very good’ or ‘very fine’.  Gollee! But this set of entries also shows that at least in the early-to-mid 1910s, and possibly later, this set of related words were used by youth from coast to coast, perhaps most widely in Nebraska, presumably mostly in speech (or else they’d show up in more texts).  No mere nonce-words, these seem (at least the shorter ones) to have been in some sort of regular usage among at least some youth-oriented or college-oriented speech communities for at least a little while.  Slang – to be sure.  Jocular – of course.  But definitely ‘real’ words used more than once by more than one person.    It makes me much more comfortable leaving eellogofusciouhipoppokunurious on my list, right where it’s been for the past 17 years.

After all this, there’s still one question remaining: why on earth would you need such an unwieldy synonym for ‘good’?   Granting that ‘flippercanorious’ isn’t so bad, and ‘hypoppercanorious’ is at least manageable, ‘eellogofusciouhippopokunurious’ strains the tongue and the eyes. So why bother?  The word’s value lies in its very size.  Compare ‘supercalifragilisticexpialidocious’, another super-long English word that, contrary to popular belief, didn’t originate with Mary Poppins at all, but, quite possibly, in a very similar context, among the youth of the 1910s, and certainly by the 1930s.  Or ‘floccinaucinihilipilification’, created by students at Eton in the 18th century out of four Latin roots, to mean ‘to value as worthless’.   Or ‘pneumonoultramicroscopicsilicovolcanoconiosis’, created as a joke in the 1930s by the president of the National Puzzlers’ League.   All of these words are coined facetiously, have simpler synonyms, and serve as an emblem of social value for their users, pointing to themselves as clever people who know long words.  Even ‘antidisestablishmentarianism’, which refers to a real 19th century political movement in Great Britain, wasn’t actually used as a word in 19th century Great Britain.  The earliest reference in the OED is from Brewer’s Dictionary of Phrase and Fable (1923 edition) in the ‘Long words’ section.

In fact, the only places where any of these words normally show up in ordinary usage today is in discussions of what is the longest English word.  They each have a definition, which gives their lexical meaning, but their meaning in context and in actual use – their social meaning – is to show off the fact that one knows long words, and presumably, by extension, that one is intelligent.  It is a small wonder that these words are developed and used frequently by students – those who have the most to gain by claiming cultural capital associated with intellect.   As a linguistic anthropologist, I want to know what words mean, of course, but I also want to know how they are used in actual social contexts.  And really, what we have here is a whole category of words that, regardless of their specific meaning, are used in the same way, to impress and overawe the listener or reader with their users’ erudition.  That’s pretty darn eellogofusciouhipoppokunurious indeed.

Explaining Californian language diversity

Posted by schrisomalis on August 31, 2013

A recent paper in the Proceedings of the National Academy of Sciences, summarized in a Los Angeles Times news article, argues that there is a strong correlation between the linguistic diversity and ecological diversity of various parts of prehistoric California.   Using satellite images of plant growth in different areas of the state and comparing it with known or hypothesized distributions of linguistic groups, the authors, Brian Codding and Terry Jones, argue that to understand the density of languages in some areas of the state and the relative sparseness of languages in others, ecological variables such as environmental productivity need to be taken into account.   Specifically, they argue that waves of migration to ecologically attractive areas produce dense areas of language diversity, whereas ecologically unproductive environments are less diverse.

Now, I should say right up front that I’m not convinced by this study or by the media account of it.   I’m not going to go into all the reasons here (yet), because my students and I are going to talk about this study on Tuesday.  I think it embodies some of the more serious problems with studies of the language-culture intersection, and some of the more serious problems with science reporting.   Figuring out how to ask relevant analytical questions about material like this is, I believe, a critical step in advancing not only anthropology in the media, but the science as a whole.

Up and at them!

Posted by schrisomalis on August 29, 2013

Well, here we are again at the first day of classes (for me) at Wayne State.  This year my Language and Culture undergraduate class will be following and reading my blog posts here as part of our in-class discussions, and material from here will also end up on their final take-home exam.   So we may see comments and questions here from some newcomers from my undergrad class, who have no prior background in anthropology, linguistics, or both, and any of your comments and questions may show up as discussion fodder in my classroom.  Your kindness in the spirit of pedagogy is appreciated – thanks in advance.

It’s also the first week for a lot of new graduate students, both here at Wayne State and across the country, in all sorts of fields.  So, in the interest of spreading the word, here’s a great article, ‘The Ten Commandments of Graduate School‘, that deserves a careful read not only by students, but their mentors as well.

New evidence for Madagascar settlement history

Posted by schrisomalis on August 19, 2013

There’s a fascinating new article in PNAS, ‘Stone tools and foraging in northern Madagascar challenge Holocene extinction models‘, outlining a long chronology for the settlement and early habitation of Madagascar.   The traditional wisdom is that Madagascar was uninhabited until around 500 CE when Austronesian speakers from southern Borneo migrated several thousand kilometres westward, and Bantu-speaking East Africans crossed the Mozambique Channel, producing a civilization of iron-using swidden farmers and creating an ecological catastrophe in which many native species went extinct.    The discovery that the Malagasy language is most closely related to the Southeast Barito languages of Borneo, proposed systematically for the first time by Otto Dahl in the 1950s, is one of the most significant and surprising findings in historical linguistics of the past century, given the enormous geographic distance between the two regions.  Later, Dahl helped to establish that Malagasy also has an important Bantu linguistic substratum, and more recent genetic evidence confirms that both African and Southeast Asian migration was involved.

This new study, whose first author, Robert Dewar, unfortunately passed away before its publication, shows the situation is significantly more complex, and that there is a history of hunter-forager habitation in at least some parts of Madagascar going back up to 4,000 years (i.e. 2,500 years more than previously acknowledged by the traditional hypothesis).    I’ve always wondered how it was that Madagascar, which is not that far from the East African coast, could remain entirely uninhabited by humans for so long.   The new study, based on fieldwork conducted a few years ago at two rock shelters in the northern part of the country, shows a vibrant hunting-foraging adaptation with microlithic tool technology to have existed far earlier than previously suspected.   This tool tradition has similarities with both East African and Middle Eastern traditions of the same period, but not with Southeast Asian ones (unsurprisingly).    What this tells us is that there was a previously-unidentified pre-Bantu, pre-Austronesian population on the island, probably of East African ancestry for millennia before the extinctions of Madagascar’s megafauna began in earnest,  It requires that we rethink the model that sees the arrival of humans on Madagascar as the simple direct cause of the extinctions, and forces us to instead ask what sorts of human-environment interactions cause effects, and how.

Phrontistic revisions

Posted by schrisomalis on August 14, 2013

For those of you who may not know, I run a sister site to this blog, The Phrontistery, which in one form or another has been around since 1996, and which features an online dictionary of rare words, glossaries on various topics, and other language-related resources.  While the site has been more or less dormant for a few years – mostly I’ve just been keeping the place tidy without adding any new content, I’ve had a slow(ish) summer and so took the opportunity to get things up and running smoothly there again, with a bunch of new content and a new site layout.     Over the years I’ve given a lot of thought to somehow combining the two sites, e.g., by moving Glossographia over there or something, but I’ve never had the energy to figure out how difficult that would be.   Let me know if you think that would be a terrible (or great) idea, in which case I don’t have to think about it any more.

Figurative is my middle name

Posted by schrisomalis on August 2, 2013

Stephen Figurative Chrisomalis.  Has a nice ring to it, don’t you think?

In seriousness, an offhand remark I made to my wife this morning, that “Compliance is my middle name,” led me on a very interesting search for the origins of the figurative use of middle name to refer to a paramount or notable characteristic of a person.  The entry for middle name in the OED has been relatively recently updated, and includes numerous instances of this figurative use going back to 1905, where the New York Journal has “For retiring you’re—well, that’s your middle name.” and other quotations going up to the present.   I did a little further searching around and was able to find an earlier one going back to 1902, in the Manitoba Free Press, quoting a correspondent from Dawson, Yukon Territory (and you’ve got to know that I love it when I antedate something and it turns out to be Canadian):

Source: Manitoba Free Press (Feb 28, 1902), p. 1

Source: Manitoba Free Press (Feb 28, 1902), p. 1

This isn’t quite enough to associate it firmly with the Klondike Gold Rush, but it’s a possibility.  I was able to find some others from the early 20th century (mostly American, but no others from Canada) and then onward from there.  But I also was intrigued when I looked at the Google Ngram for ‘is my middle name’:

Google Ngram search for 'is my middle name' and variants

Google Ngram search for ‘is my middle name’ and variants

That spike peaking right around 1920 is really interesting; equally interesting is that thereafter, it drops back down to relatively modest levels until the 1960s, and then takes off again, reaching its historical peak in the late 1980s and keeping right on going.    Now, it’s clear that the initial rise starts well before World War I, so this isn’t something directly associated with soldiers’ slang or the general mixing of dialects during and after the war, but looking at the Google Books results around 1920, this really seems to have been a fad at the time – most of the uses of “is [pronoun] middle name” are non-literal.  But by the time that Agatha Christie wrote, in The Murder of Roger Ackroyd (1926: 144), “‘Modesty is certainly not his middle name.’ ‘I wish you wouldn’t be so horribly American, James.’”, the fad was already on the wane – although it never disappeared entirely, for the next several decades it was quite rare.

I was a little surprised to see that there was no immediate bump related to the country anthem Sixteen Tons, first recorded in 1946, and whose rendition by Tennessee Ernie Ford in 1955 reached #1 on the Billboard charts for six weeks, but its line, “Fightin’ and trouble are my middle name” is in a middle verse and perhaps had little linguistic effect (although it was re-recorded many times throughout the 60s and beyond).    The post-1965 bump could equally have been inspired by Bobby Vinton’s 1963 single, “Trouble is My Middle Name“, although it peaked only at #33, and, if I may say so, is not really very good.  Regardless of the specific impetus, once it took off, it became strongly idiomatic, and today the phrase has become so well-known that it is covered in TV Tropes and elsewhere.  I’m confident that my readers will regale me with their favourite examples.

How do you pronounce Detroit?

Posted by schrisomalis on July 21, 2013

As you may know, I work at Wayne State University in Detroit, Michigan.  Detroit’s been in the news a lot lately, regarding its bankruptcy and a whole lot of other things that, if I were to start talking about them at any length, would just send me off into a rage.  This would not be pretty.

So instead, let’s talk about the language of Detroit.  Detroit has two main English language varieties: first, the local variety of African American Vernacular English (AAVE), which has been studied by John Rickford, Geneva Smitherman, Roger Shuy, and others; and second, the variety of English that has undergone the Northern Cities Vowel Shift.    Here, Penny Eckert’s work is of the greatest significance, building on the work of Bill Labov, but with a specific focus on southeastern Michigan.    Most black Detroiters speak the first variety, and most white, locally-born Detroiters speak the second, with some exceptions.    Today I’m going to focus just on the second variety, but I’m not going to talk about the dialect as a whole.  Instead, I want to talk about variation, and some innovations I’ve noticed, in the speech of Detroit-area residents in the pronunciation of a single word – the place-name Detroit itself.

Now, if you’re like me, and like most North Americans, you probably pronounce Detroit something like /dɪˈtɹɔɪt/, or (for those unfamiliar with IPA, di-TROIT, two syllables, with second syllable stress and rhyming closely with adroit (sound sample).   You can hear a clear example of this pronunciation, for instance, in this commercial for the Detroit Zoo.  Other examples could be found pretty readily, so I won’t belabour the point.  This is the standard pronunciation of Detroit and the baseline for today’s discussion.

There is a second variant heard locally, which has first syllable primary stress rather than second-syllable stress, so, in other words, and where the unstressed first vowel becomes /i/, so, in other words, /ˈdiˌtɹɔɪt/ (DEE-troit).    The second syllable then has secondary stress (it can’t be entirely unstressed or the vowel would have to reduce).    You hear this pronunciation sporadically and without any particular association with any class or ethnic group, but it’s less common, and we’ll leave it alone.

Lately, however, I’ve been hearing a third pronunciation in a lot of commercials, local news, and the like, in which white Detroit-area residents pronounce their home city as /dɪˈtɹʌɪt/ or even /dɪˈtɹəɪt/, with the first part of the OI vowel unrounded and fronted almost to a schwa.    For the non-phonologically-inclined,  what seems to be happening is that di-TROIT starts to sound more like di-TRITE.  Here’s a good example from a Youtube video from Detroit Real Estate Investing.  If you’re not convinced, try loading up both this video and the one from the Detroit Zoo, and running them one after the other to compare a couple of times.    Still not convinced?  Try this video for America’s Best Value Inn, for another example.  Still not convinced?  Try this clip from a video made by a local man, with two very clear examples right in a row.

As far as I’ve heard, this variant is only used by people from the Detroit area who have the Northern Cities shift – i.e., it’s not used by people from Milwaukee or Cleveland or Buffalo.    It’s not, as far as I can tell, part of the standard analysis of the Northern Cities shift or the specific changes found in the Detroit area, after a lot of time poking around the sound files on Penny Eckert’s website.  I haven’t noticed this with any other words that contain the diphthong oi.   I wondered whether it might be typical of other words ending in oit or in oi followed by other voiceless stops (e.g. p, t, k).  But the problem is that very few words in English end in oit (really just exploit, which is moderately uncommon, and quoit and adroit, both of which are very rare) or oi followed by any voiceless stop consonant (we could add voip and hoick but that’s about it). So we don’t have a lot of other words to compare it to, without going and doing some sociolinguistic research.  (This is a hint to any of my future students who may be reading this post).

I can’t find any publication discussing this phenomenon – I’m not a phonologist and would love to hear from someone who could link this up to the Northern Cities shift more broadly.  I don’t have any explanation for it, but it’s widespread enough that it deserves some attention.   

But wait – we’re not done yet!  The reason I wondered about the role of voiceless stops is that while I work in Detroit, I live across the international border in Windsor, Ontario, which is essentially Detroit’s Canadian suburb, and I am a native speaker of Canadian English.   Most speakers of Canadian English, including myself, have what’s called the Canadian raising, in which the // diphthong of right and ripe, and the /aʊ/ diphthong of about and house, is raised to /ʌɪ/ or /ʌʊ/ before voiceless consonants and especially voiceless stops – which is why some Americans think that Canadians say aboot or aboat.  We don’t, but it might sound that way.    And because I, like most Windsorites, have a pretty strong Canadian raising, I pronounce right as /ɹəɪt/ or /ɹʌɪt/, starting with a mid-vowel.  Notice that this diphthong is exactly the same as the diphthong in the innovative pronunciation of Detroit.  In other words, the ‘oi’ of Detroit for some Detroiters is the same vowel sound as Canadians have in right or trite.  They start in completely different diphthongs – /aɪ/ and /ɔɪ/ – but end in the same place.

To make things even more complicated, there is a  fourth variant of Detroit used only, as far as I can tell, by older speakers of Canadian English.  This is a three-syllable version,  /dɪˈtɹɔɪ.ɪt/, or di-TROY-it, to rhyme (non-ironically, I promise!) with destroy it.   Most of the users of this variant are Ontario-born native speakers of Canadian English born in the 1950s or earlier.     It can be heard, most famously, in the song, The Wreck of the Edmund Fitzgerald by folk singer Gordon Lightfoot, as heard here.     It is also typical of the Canadian hockey commentator / blowhard / redneck Don Cherry.   I have certainly heard it in Windsor although I suspect it is more common in central and eastern Ontario than here in southwestern Ontario.  To this day, and despite all evidence to the contrary over the five years I’ve worked here, my mother (who is of a similar generation, born and raised east of Toronto) refuses to quite believe me that the two-syllable pronunciation is even acceptable or possible.    I don’t have any explanation for the emergence of this variant either, but it’s obviously been around for many decades.   I’d also love to know whether it’s found widely in any younger Canadian English speakers.

So, in summary, there are four distinct variants of the pronunciation of Detroit, all of which you might hear in the broader Detroit area on any given day:

  • /dɪˈtɹɔɪt/ (di-TROIT), used by locals and most other English speakers
  • /ˈdiˌtɹɔɪt/ (DEE-troit), used by locals sporadically
  • /dɪˈtɹʌɪt/ or /dɪˈtɹəɪt/ (di-TRITE), used by locals who have a strong NCVS
  • /dɪˈtɹɔɪ.ɪt/ (di-TROY-it), used by (some) older Canadians, including some in Windsor.

Glossographic news of the week

Posted by schrisomalis on July 17, 2013

Another busy week of news relevant to readers of this blog:

A few months ago, the Times Literary Supplement reported on the curious case of A.D. Harvey, who was unmasked as a serial creator of false personae who collectively had created a self-perpetuating network of literary fraud, of course all Harvey himself, until the false story of a meeting between Dickens and Dostoyevsky was unmasked and, with it, Harvey himself.  Now, this week, the Guardian interviews Harvey, giving a fascinating glimpse into the sort of person who would spend decades creating false identities and fictitious scholarship.

Some of my readers who are keen on cryptography have probably already seen this article in Wired, discussing the fascinating Kryptos Sculpture and its secret decipherment by the NSA years before its official CIA decipherment.  The Kryptos Sculpture is one of those things that, in the absence of context, would clearly cause Phaistos 0r Voynich-level excitement in future decipherers.

Not to leave my typography buffs out in the cold, this week the news has been going around about Paul Mathis, the Australian restaurateur who has created a new letter of the alphabet, a logogram ‘Ћ’ for the word ‘the’ to parallel & for ‘and’, at the cost of $38,000.  Alas, I don’t see this one catching on, even though the sign already exists in most character sets as a Cyrillic character.   I just want to know what costs $38,000 to develop an already-existing character.

The always-fascinating Language Log has a post this week about the fascinating Potosí miners’ language, a mixed language of Spanish, Quechua and Aymara used from the 16th century to this day by miners in central Bolivia.    The survival of this fascinating variety is highly dependent on the continuity of traditional mining practices and a multiethnic speech community.

For those of you who have a PhD or are in a doctoral program, you may want to check out this visualization of the lengths of dissertations at the University of Minnesota.  My field, anthropology, is second-longest (after history) and has the widest range of any discipline, unsurprisingly in a discipline that spans both natural science and humanities.   As for me?  My dissertation checks in at 663 pages and is an extreme outlier in any field.  Woohoo!

Great news in Native American baseball sociolinguistics: the Arizona Diamondbacks hosted the first-ever baseball game broadcast in Navajo (or indeed, any other Native American language), in honour of their Native American Recognition Day.    Now if only we could do something about those pesky mascots elsewhere in the league …

You may have heard this week that J.K. Rowling, the author of the blockbuster Harry Potter series, was unmasked this week as the author of a crime novel, The Cuckoo’s Calling, published under the name of Robert Galbraith.   While some of the evidence leading to the break was ordinary sleuthing, there’s a neat discussion in Ben Zimmer’s column in the Wall Street Journal of the role of forensic stylometry, or the linguistic analysis of texts to ascertain authorship, in confirming and breaking the story, with a complementary essay at Language Log by Patrick Juola, who did the analysis, of the science underlying it.

I was so very pleased to see the first post in nearly a year over at the philology blog, Stæfcræft & Vyākaraṇa. And this one is great – a historical linguistic analysis of the Finnish expletive used by Linux developer Linus Torvalds, with digressions into Indo-European mythology.   I hate to disagree, with Torvalds, though: there certainly are enough swear words in English, although the Finnish ones sure are fun too!

Stephen Houston and Alexandre Tokovinine write over at the Maya Decipherment blog about some newly-analyzed earspools and a hair ornament bearing Maya glyphs.  It’s a shame to not have any provenance on these, part of the great tragedy that is looting in Mayan archaeology, but fascinating nonetheless to see Maya writing outside of monuments and codices, in a decorative context.

Lastly, Stefan Fatsis, the author of Word Freak and general expert on Scrabble, writes this week in the New York Times about the decision by Hasbro to fold the National Scrabble Association, effectively ending its sponsorship of competitive Scrabble.   While the immediate effect may be slight – Hasbro’s commitment has been waning for several years and the independent NASPA is going strong, as far as I know – it’s sad to see the abdication of responsibility among game manufacturers for the cultures that keep them vibrant.

Conservative skewing in Google N-gram frequencies

Posted by schrisomalis on July 14, 2013

Google Ngram Viewer is a great tool, especially for rough-and-ready searching and visualization of linguistic trends, and as a teaching tool to introduce students to lots of interesting questions we can ask about language variation and patterning.  I use it all the time.  The default search parameters are for 1800 – 2000, and the Culturomics project notes that, “the best data is the data for English between 1800 and 2000. Before 1800, there aren’t enough books to reliably quantify many of the queries that first come to mind; after 2000, the corpus composition undergoes subtle changes around the time of the inception of the Google Books project.” Elsewhere, the Culturomics FAQ  notes that, “Before 2000, most of the books in Google Books come from library holdings. But when the Google Books project started back in 2004, Google started receiving lots of books from publishers. This dramatically affects the composition of the corpus in recent years and is why our paper doesn’t use any data from after 2000.”

OK, so we’ve been warned that the data from before 2000 is very different than the data from after 2000, and especially that 2004 marked a significant change in the corpus. Caveat lector, or whatever you will. But I want to know: In what ways have these ‘subtle changes’ changed the Google N-gram corpus, and therefore, what biases in word frequencies do scholars of language need to account for?

Lately, I’ve had some interest in post-2000 changes in word frequencies for my Lexiculture class project for the fall, and so I’ve been looking at N-gram data going up to 2008 (the last date you can search).  I have found some very weird declines in words that probably aren’t actually declining in relative frequency:










It seems notable that all of these words start to decline shortly after 2000, with a particularly steep decline right around 2004-05.    All of these words, I would argue, should be stable or increasing in frequency: these are words associated with modern technology and social life. Conversely, many timeless words (e.g., table, lamp, daughter) are flat or rising after 2000. It’s possible that intuitions about what should be happening to words can be wrong. But why are they all wrong in the same direction, and why do they all decline all at the same time?

- One possibility is that the data from 2000 onward aren’t complete yet.  There could be some books published over the past few years that haven’t been integrated into Google Books and thus don’t end up in the Ngram viewer.  But in any case, n-grams measure a word’s frequency relative to all words published in that year, so the fact that the collection isn’t complete should not affect relative word frequencies at all.

- It’s possible that Google Books has systematically missed archiving books oriented towards technology, but why would that be the case?  In fact, if tech-savvy publishers are more likely to submit their works to Google Books (which I think is plausible) than your average publisher, the effect should be to increase these words’ frequency.

- It’s possible that, in the absence of the controlled digitization of books from libraries that characterized the early period of Google Books digitization, and the work done to manage metadata in creating the N-Gram Viewer’s early dataset, massive error has crept into the database.  But again, why would this affect particularly modern words negatively, while not affecting words whose frequencies has not been changed?

I think I have a better answer.  I think that the N-gram Viewer may be skewed, not because anything significant is being missed, but because something significant is being added.  There is a growing tendency for cheap electronic reprints of public domain books to come out and be immediately included in Google Books, with the publication date listed as the date of its electronic reprinting.    If Levi Leonard Conant’s book The Number Concept (1896) is scanned and reprinted by Echo Books in 2007, the Google Books metadata doesn’t recognize it as an 1896 book at all.    It’s digitized and scanned twice, once (correctly) as an 1896 book and again as a 2007 book. In fact, because it’s in the public domain, I could make my own e-book version for sale as a 2013 book and have it listed again.     And while that’s not likely to have a huge effect, imagine every reprint of A Tale of Two Cities or Wuthering Heights that has flooded the market since the invention of e-books, stimulated by and reinforced by projects like Google Books.

Now, I suppose there is a case to be made that the 2007 reprint of Conant is, in some way, a 2007 book.  After all, reprints have never been excluded from Google Books and there are plenty of pre-electronic 20th century reprints of Wuthering Heights in the corpus.   But each of those earlier reprints represents a costly decision by a publisher that a particular book is important enough and will be read widely enough to warrant its republication.  From a ‘culturomics’ perspective, there’s a case to be made that these reprints really constitutes a cultural ‘signal’ in the year of its reprinting, and from a linguistic perspective, we presume that lots of readers will read the words, no matter if they are obsolescent at the time.  But as the cost of producing reprints as e-books (or print-on-demand) declines, the ‘culturomic’ value of these books also declines, because publishers no longer need to be concerned about whether many (or even any) people buy these books.   The author is long dead, so there are no royalties, and there are no or minimal up-front publishing costs.   So Google Books is now being flooded with material that may be largely unread and does not reflect the linguistic or cultural values of the time.  Its primary effect, for the N-gram viewer, is to skew relative word frequencies in a way that makes 2013 resemble 1913 more than it actually does.  That’s a conservative bias, for those following along at home.

We can then derive a couple of corollaries to check if this theory is correct:

- There are likely to be some words that, while still increasing in frequency, do not increase in frequency quite as much as their actual use should indicate.  These are words that have shot up out of nowhere over the past few years, and are continuing to accelerate, but their N-gram shows a tapering off.  We see a great example of this in a word like transgender, where we see, right around 2004, a clear decline in the acceleration of its frequency, counter to expectations.

- If some word frequencies are artificially depressed, some other word frequencies must be artificially inflated.  But which ones?  There are likely to be other words that were very common in the 19th and early 20th centuries (the period where most of these reprints are going to come from), but have been on the decline for a long time and are now quite rare, that show an apparent ‘rejuvenation’ after 2004.  Again, we find such a word: negro (uncapitalized), which is virtually non-existent in contemporary written English but was at its peak in the period from 1880-1920, and which shows a clear ‘bump’ after 2004 which can’t possibly be real.    You can even see this to a lesser degree with a word like honesty, which (for reasons perhaps best left unanalyzed) had been in decline throughout the 20th century but experiences a bump, again, right around 2004.

In summary, because the Google Books corpus today is derived largely from publisher submissions, and because there is a major signal coming from reprints of public domain books published before 1922, n-grams from 2004 onward (and, to a lesser degree, from 2000-2004) are skewed to make modern words appear more infrequent than they actually are, and obsolescent words more common than they are.    The moral is not that Google is evil or conservative or that culturomics is stupid or that the N-gram Viewer is fatally flawed.   I do think,  nonetheless, that we ought to be aware that the specific kinds of unintentional skewing that are being produced are ones that tend, in a conservative direction, to replicate the linguistic and cultural values of a century ago.  This problem is not going away, absent a systematic effort to eliminate reprints from some future N-gram dataset, and it may even be getting worse as electronic reprints become more and more common.  Stick to the pre-2000 data, though, just like they advise, and you’ll be in good shape.

Thanks to Julia Pope for her consultation and assistance on aspects of Google Books metadata and cataloging practices.

Posted in Academia, Linguistics | 4 Comments »


