A curiosity about the F-word in Google Ngram Viewer

by Jakub Marian

Tip: Are you a non-native English speaker? I have just finished creating a Web App for people who enjoy learning by reading. Make sure to check it out; there's a lot of free content.

Google Ngram Viewer is a tool you can use to plot how common a word or a phrase was through the years in literature. I use it a lot to learn about historical usages of various words and idioms, and I noticed a certain oddity today. If you enter the English vulgarism “fuck” in the Ngram Viewer and hit “search”, you will see the following graph:

fuck-ngram

What was going through our ancestors’ minds? Was there a period of fierce sexual activity that ended around 1800s?

The first thought that springs to mind is that “fuck” meant something else at the time, but this is not the case. A quick search in an etymology dictionary reveals that the “earliest appearance of current spelling is 1535”, and there doesn’t seem to be any alternative meaning of it.

Optical character recognition

It turns out our ancestors weren’t as perverted as we may imagine (or at least we cannot prove that based on the graph above). What we see is in fact a big mistake in the database the software uses.

Gutenberg may have revolutionized publishing of printed books, but electronic versions of his books were terrible (perhaps because electricity hadn’t been invented yet), so Google has to scan the books, page by page, and then run a bunch of different OCR (optical character recognition) programs to convert the pictures to text. And that’s the sticking point.

Take a look at the following text:

long-s-example

Did you notice something strange? It is an excerpt from a book published in 1645. It looks very much like text on this website (it is quite extraordinary how little has changed during the last 400 years), with one major exceptionthe S.

Until the early 19th century, it was common to typeset the letter “s” at the beginning and in the middle of a word using a symbol that looks like “f” without the middle line (called long s). For example, the title page of the first edition of Milton’s Paradise Lost reads:

paradise-lost

It’s no wonder OCR software interprets the character as “f”. What comes up as “fuck” in the Ngram Viewer before it became widespread in modern language is in fact “suck” (as is clear from the excerpt above, for instance).

The Oxford Companion to the Book states about the long s that it “rarely appears in good quality London printing after 1800, though it lingers provincially till 1824”, which is completely in accordance with the Ngram plotted at the beginning of the article (“fuck” disappears around 1825). The moral of the story is that the Ngram viewer cannot be trusted when it comes to words containing an “s” before the year 1825.

By the way, have you already seen my brand new web app for non-native speakers of English? It's based on reading texts and learning by having all meanings, pronunciations, grammar forms etc. easily accessible. It looks like this:

0