But what you may not know is that you also have helped archivists decipher distorted characters in old books and newspapers so that they can be posted on the Web.
You might think that computer scientists would have figured out a way to get computers to decipher those characters. But they haven’t, so instead they’ve figured out a way to harness all that effort you’re making to protect your security. “When you’re reading those squiggly characters, you are doing something that computers cannot,” says Luis von Ahn, a computer scientist at Carnegie Mellon University (C.M.U.) in Pittsburgh.
Von Ahn and colleagues reported last week in the journal Science that Web users have transcribed the equivalent of 160 books a day—that’s more than 440 million words—in the year since researchers kicked off the program. The initiative is similar to “distributed computing” schemes like SETI@home, which take advantage of unused personal computer processing power to sift through signals received from space for those that might be generated by extraterrestrial intelligence or to figure out how proteins fold. But the difference with this system is that people, not processors, do the calculations.
“We are getting people to help us digitize books at the same time they are authenticating themselves as humans,” von Ahn says. “Every time people are typing these [answers] out, they are actually taking old books or newspapers and helping to transcribe them.”
Von Ahn’s team’s method is a twist on the Web site tests known as CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), which have been in use since 2000. The new twist on CAPTCHAs is to use a set of letters from old, weathered books and newspapers that computerized transcribing programs cannot recognize. Much of the raw “fuel” comes courtesy of the Internet Archive project, which transmits words that its OCRs cannot recognize or do not appear in the dictionary.
About 40,000 Websites now use the service, called reCAPTCHA, which the project’s site offers for free. Facebook was one of its first major patrons.
Von Ahn estimates that at reCAPTCHA’s current rate of transcription (about four million words a day missed by OCR systems), the program does a week’s worth of transcription from 1,500 professional transcribers in a single day. This data is stored on hard drives at C.M.U. and then sent back to the organization that requested the transcription. (The New York Times, for example, has enlisted reCAPTCHA to digitize the newspaper’s archives dating back to 1851.)
When the researchers compared how reCAPTCHA and OCR transcribed five Times articles, reCAPTCHA did a significantly better job—99.1 percent accuracy—than OCR of the sort that Google uses for its book project, which came in at 83.5 percent. (Google declined to comment for this story.)
But as is the way with most technology, today’s innovation is tomorrow’s VHS tape. Eventually computers will be able to decipher reCAPTCHAs, too. “We’ll get a few good years out of reCAPTCHAs,” says co-author Manuel Blum, a professor of computer science at Carnegie Mellon and key developer of some of the first CAPTCHAs.
OCR will continue to improve as well, Blum says, along with so-called machine learning in general.
Either way, with some 100 million books published prior to the dawn of the digital era, says von Ahn, that “makes for a lot of words.”