Monday, March 21, 2011

Lunchtime Keynote Makeup


I decided to pass on a formal lunch today and see the Google Books talk that we were unable to see earlier and find miscellaneous calories as I could manage.

The mission of Google is to “organize the world's information and make them universally accessible and useful.” The mission of Google Books is thus to “organize the world's books and make them universally accessible and useful.”

Google's now scanned over 15 million books (5 billion pages, 2 trillion words). The majority of books have come from libraries (about 3 million have come directly from publishers). The process goes scan → image process (correct for page curvature) → OCR (originally just for search, but Google purchased ReCaptcha to try and improve percentage of properly recognized words) → Tag → Metadata → Rank (not book to book references, but circulation and sales data) → index.

He covered a lot of interesting problems Google has, like the wide variety of languages, date inconsistency, page number recognition problems, mis-matched metadata (J.R.R. Tolkien vs. John Ronald Reuel Tolkien).

One striking graph showed the large number of books that are out of print but under copyright.

A huge amount of Google Books users are students. There is a big drop in traffic during spring/winter breaks.

No real news on the book settlement, but he provided a brief overview of the current status.

He mentioned the ngram viewer and had some interesting observations on the regularization of verbs in English over time. The example graphs he provided for ngram were good: “United States are” vs. “United States is” and the prevalence of different decade mentions in literature (1980, 1990, 2000, etc.).

In the Q&A session he mentioned that there is a process for getting missing pages from scans added. Sharing libraries as a feature is coming to Google Books. He mentioned that there is some effort to correct public domain errors (books published by the U.S. Government being shown as under copyright).

The talk was interesting, but not really fantastic. Biggest annoyance James Crawford's consistent mispronunciation of Tsunami (tsoo-nah-mee) as "too-sah-mee," which for some reason he thought would be a good example word throughout the talk.

No comments: