- Preserving Written Treasures for the Ages via Digital Makeover
- When it comes to books, "the classics" are part of what makes Europe's history so rich in culture. Unfortunately, due to fire, water damage and/or simply the passage of time, many great works are in various states of decay. But IBM and the European Union are now deploying a series of innovative technology and collaboration efforts to ensure these treasures are digitally preserved forever.
The book "Magic: Principles of Higher Knowledge" has maintained a following for centuries now because author Karl Von Eckartshausen conveyed such a profound sense of spiritual insight within the book's pages. His clarity helped demystify the issues of a chaotic world, according to one reviewer. "Great Secrets will reveal themselves to you. … All we have to do is ask!" Eckartshausen wrote when the book was published in 1788.
The original copy was damaged by fire and water in 1943, and has remained that way for decades. But thanks to a partnership between IBM and the European Union (EU), Eckartshausen's "magical" book is getting a digital rebirth via a project called IMPACT (IMProving ACcess to Text). The goal of the project is to provide highly accurate digitization of rare and culturally significant historical texts on a massive scale—an effort that involves two dozen national libraries, research institutes, universities and companies throughout Europe.
Unlike past digitization projects that have resulted in static, online libraries of texts, IMPACT will enable participants to efficiently and accurately produce quality digital replicas of historically significant texts and make them widely available, editable and searchable online. Funded by the EU, IMPACT's research combines the power of innovative, Web-enabled adaptive optical character recognition (OCR) software with "crowd computing" technology. Crowd computing will allow for groups of volunteers throughout the continent to verify the accuracy of processed texts and correct recognition mistakes using an online Web system.
The IMPACT system is also capable of "learning" from its recognition errors and adapting automatically to the specific font's characters. The result is faster digital delivery: A small book's digitization would take 1 hour using standard OCR technology with manual correction. IMPACT can reduce that time to 15 minutes.
The solution is also expected to decrease error rates by more than one-third. IMPACT improves review by avoiding the display of an entire scanned page, allowing reviewers to only see the actual letters or words in question in the IMPACT system. For example, the letter combination "r" and "n" ("rn") may appear indistinguishable from the letter "m." In these instances, the system collects many instances of the letter "m" and places the samples next to the letters in question, making it much easier to determine the letter's real identity.
In cases where an entire word is suspect, it is added to a collection of other questionable terms, which are then arranged in alphabetical order. Volunteer reviewers need only accept or reject suggested substitutes with one keystroke.
In addition, the system uses adaptive dictionary enrichment, a method by which new words are added to a central dictionary based on cross-identification and correction by other users.
"The only way to make a large-scale digitization project work is to dramatically improve the quality of the initial OCR and cut down post-processing tasks as much as possible," says Hildelies Balk, head of European projects at Koninklijke Bibliotheek and the leader for the IMPACT consortium. "With this effort, we're expecting to see remarkable increases in productivity in the digitization process."