A feisty embuggerance

When I grade my students’ paper proposals, I make a point of doing a brief Google Scholar search for each student’s proposal, which a) helps me evaluate how thorough they have been; b) helps me help them find additional material (I then give them the sources I found, but also the keywords I used to find them). One of my students in my introductory linguistic anthropology course this term is doing a paper on linguistic aspects of laughter and humor. During my search, I encountered the following citation (direct from Google Scholar to you):

Embuggerance, E., and H. Feisty. 2008. The linguistics of laughter. English Today 1, no. 04: 47-47.

After I stopped laughing, I set to figuring out what was going on.

1) I quickly discarded the theory that an unlikely duo of scholars actually had this pair of names – although that would have been too awesome for words. In fact, no other article listed in Google Scholar has an author named ‘Embuggerance’ (although there are a couple other Feistys).

2) I also considered the possibility that this was one of the many metadata errors in Google Scholar; for instance, there are thousands of articles whose purported authors are named Citations or Introduction or Methods, due to errors where it interprets headings like “IV. Methods” as a name “Dr. I.V. Methods”. But this seemed unlikely in the extreme in this case.

3) This left the possibility that these were pseudonyms adopted by particularly amusing authors as part of a parody article.

In this case the article is in fact a book review (which I could tell because it’s all on one page), so I didn’t recommend it to the student, but I did request it for my own edification. Lo and behold, it arrived today as a PDF.

‘The linguistics of laughter’ is a book review of a The Language of Humour by Walter Nash. It’s perfectly ordinary and non-satirical, and it does not contain the words Embuggerance or Feisty. But next to it is another book review, entitled ‘Concise and human’ which contains the following passage (emphasis added):

Silverlight’s concise and human reports cover a surprising range of curious items, from Acid Rain through Bottom Line, Catch 22, Dinner/Supper, Embuggerance, Escalate, Feisty, Holistic, Krasis, Ms, Naff, Quorate, Shambles and Viable to Yomping.

The four bolded words appear on a single line, and the fact that the Google Scholar metadata thinks that the initials of the ‘authors’ are Dr. E. Embuggerance and Dr. H. Feisty seals the deal. This is the source, and so something like option 2 above is correct. But this is really weird. Not only do the pseudo-authors appear in the middle of a contextualized sentence (not in headings), but the sentence is in the wrong review – a review that itself is found (mostly correctly) in Google Scholar!

To make matters even worse, at the end of the reviews section the phrase ‘Reviews by Tom McArthur’ appears – an attribution which is found in the metadata for ‘Concise and human’ but not for ‘The linguistics of laughter’. And, as if this were not bad enough, even though both reviews are listed as being from 2008, the PDF clearly shows them as being from 1985. If I were a gambling man, I’d wager that 2008 is the year when the metadata was added and/or the file was scanned.

Now, mostly this is just a humorous anecdote; I don’t mean this as an indictment of Google Scholar, which I consider to be the most useful way for most scholars to find academic literature, and which I use virtually every day. But one has to wonder at the process (automated or otherwise) that leads to this comedy of errors. A great deal of virtual ink has been spilled over at Language Log (here and here, for instance) on the metadata problems with Google Books / Google Scholar and its implications for linguistic research, for tenure cases that rest on faulty citation records, and other potential problems. Until there is a way for these sorts of errors to be corrected by end users, we may all be well and truly embuggered.

About these ads

21 Comments

  1. I don’t mean this as an indictment of Google Scholar, which I consider to be the most useful way for most scholars to find academic literature

    Given the errors in Google Scholar and the fact that the articles indexed in it are often not full-text, I’m surprised that you’d think so highly of it. It sounds like you’re a professor; unless your college or university is extremely underfunded, the school library should have access to high-quality databases which contain records indexed by information professionals rather than unqualified hirelings or, worse, computers. A much higher percentage of hits in those databases will be full-text, too.

    • Laughingrat, I am indeed a professor at a major research university. But the only database other than Google Scholar in my field that does full-text searching is JSTOR; everything else just lets me search title, keywords, abstract. I found all that stuff years ago using the subscription databases.

      What’s more, Google Scholar results not only give me the ‘hit’, but also usually a snippet of text where my search keywords are found. This allows me to rapidly exclude material that I would otherwise have to read / request if I were using a traditional index.

      Google Scholar has major metadata problems; everyone knows that, including the folks at Google. But if I want to search thousands of books, journals, conference proceedings, etc. all at once, in their full text, for a reference to a person, place, or thing, there literally is no adequate subscription-only database that will allow me to do that.

      Given this fact, it behooves us as scholars to make the case to Google that they must now work to clean up these problems – or it behooves us to make the case to subscription services that they aren’t doing what needs to be done to be maximally useful.

  2. Pingback: Jay Lake: [links] Link salad wishes the Child a Happy Birthday!

  3. Pingback: uberVU - social comments

  4. I can only second your analysis of Google Scholar in your reply to the first comment. As a librarian, it is heresy to say this, but in a few short years Google Scholar has improved sufficiently and now pretty much whomps up on a lot of the subscription databases.

    Many in libraries and academia are keen to point out all of the warts in Google Scholar, but are less keen to be so critical of the databases for which they pay. That the MLA Bibliography, for example, is years behind in indexing scores of journals, and has incredibly poor coverage in many non-English languages (despite the International boast in its name) is a little known or explored fact in libraries. Other fee-based databases evince similar flaws (Library Literature, ironically, is one of the worst), but it isn’t nearly as much fun to pick on them as it is to shellac Google.

    Were I a betting man, would I put money on Google to improve Scholar, or on the subscription databases to overcome their woes? Well, that is an easy call.

    • Thanks Dale! My “favourite” ridiculousness from a subscription database is the ongoing fact that the ISI Web of Science translates all titles into English, without providing the original untranslated title, making it quite a chore to actually find accurate citation info. Whoever thought that would be a good idea?

      Google Scholar got significantly more useful for me once it started indexing JSTOR, at which point I made the switch once and for all. Full-text searching is so useful for the sorts of work that I do that I can’t imagine relying heavily on a non-full-text database ever again. In twenty years our kids will be incredulous that keywords searches covered just titles and abstracts. Well, okay, maybe our kids won’t care, but I will!

      • That full-text searching is what sets it apart. I agree. In a talk a few years ago for a group of German literature scholars, I took a hypothetical but realistic topic and compared what one could find via the MLA and Google Scholar. The narrow focus of the MLA meant that it was good at finding what one already knew about, namely, articles in literary journals. What Google unearthed was a host of relevant articles from other disciplines (as well as from journals that the MLA either doesn’t index or hadn’t gotten to yet). The audience, otherwise highly skeptical of the G word, was suitably impressed.

        What it comes down is machine indexing vs. human indexing. I cannot get the image of John Henry out of my mind when I think about this matchup. I think the human indexers can only win by extreme effort, and we all know what happened to poor John.

        BTW, I don’t think it will take 20 years for your scenario to occur. I think it has already happened in some quarters.

  5. Pingback: The Volokh Conspiracy » Blog Archive » Profs. Embuggerance & Feisty

  6. I’m bemused by all of this, and here’s why. My library (hclib.org) offers a search tool which they style as Power Search (360Search seems to be behind this, which claims to be an OpenURL link resolver), which combines searches across available data bases and includes the option of full text searches. Is this not commonly available??

    • Most university libraries have some sort of metasearch / power search that will search across many existing databases. But these can only search what those databases provide individually, which is normally just title, author, abstract, and keywords, in terms of things one is likely to search for. By ‘full text’ I mean that it searches the entire text of the article for some string, and in that case, the only two big databases I know that do so are JSTOR and Google Scholar – and Google Scholar now indexes everything that is in JSTOR too. Google Scholar also covers all scholarly books found in Google Books, which is highly valuable for scholars in the humanities and social sciences, among others.

      • tsuwm – The flaw that Stephen pointed out is not the only problem with federated searching (the generic term for the tool you mentioned). These tools rely on protocols such as z39.50 or XML gateways for searching, and vendors are notoriously dodgy in providing robust access to their data via one of those methods. Why should they? Metasearching means the user may never even see their interface, in which they invest millions in development.

        Such access methods are also slow. I have worked with various fed search technologies, and the work that goes into making it all work is insane given how poor the output is. Besides slow = irrelevant these days, where we expect Web responses under one or two seconds or go elsewhere.

        Last, but not least – even if you can search across a bunch of databases, that does not make the flaws inherent in them go away. Besides the limited data (what Stephen pointed out), there are things like indexing delays (many are still produced by human indexers), skipped journals, odd scopes, etc. Plus, nearly all library databases offered in the US skew very heavily toward the anglo-american world, which is silly given how much research occurs elsewhere and how much of it is reported in English these days. Google Scholar does not have most of the flaws I just outlined.

  7. Pingback: a feisty embuggerance of metadata! « Across Divided Networks

  8. Pingback: Lorna’s JISC CETIS blog » When automatic metadata generation goes bad…

  9. Pingback: Chairperson and English lexiculture « Glossographia

  10. Pingback: Philip Roth on the Philip Roth tour, Google Scholar slips, and more

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s