Sunday, January 25, 2009

Recovering text from old books

My house is filled with old books n Vedas, Sanskrit, and many Epics. Many of them printed atleast a century ago, and that too in Grantham. The script was used in ancient times for Sanskrit. Devanagiri is just another script for Sanskrit, just as Grantham. Even today, there are a few letters that are used in Tamil that have their roots in Grantham. Eg. , கஷ, , . People call these letters as Vada Mozhi (வட மொழி, litrally - Northern language/ Sanskrit). A lot of these letters and their variants are still found in Malayalam(, ,,) and to some extent in Telugu and Kannada as well.

These books are generally out of print. The script, doesn't have a official font, in the computers. I came to know that one professor at the Indian Institute of Technology, Madras, has developed a font for this script, but has it strictly in private circulation.

Anyways, I don't have the time/ resources to re-type the hundreds of books back into text format. What I can do, is to try and scan a few of them that my father and his friends use, and print them again (at home) so that the book itself will be preserved.

While, I was eying at this exercise, I stumbled upon one of my colleagues, who was on a similar mission. Apparently, her dad had picked up a book from a historical library and wanted to do something similar. Recover the text into a readable format from a out-of-print book.

With us, we had the following tools:
1. A digital Camera (SLR)
2. A flat bed scanner from HP
3. Adobe Photoshop (a friend has it)
4. GIMP (am a Linux Fan)
5. Original material (books - hundreds of them)

When I tried to recover the pages of the book for the first time, it was a difficult and frustrating. Finally I arrived at a "formula" that works on many of the pages that are reasonably smooth and uniform. Below are the steps that I followed (in GIMP on Kubuntu-Linux)

1. Original Scanned Page (part)

2. Increased the contrast in Gimp

3. Burnt the shadows using BURN tool
4. In, the levels dialog, I set the WHITE point using the filler tool (GIMP-> Colour->Levels)

5. DODGE the highlights

6. Burn the shadows again

7. Convert to INDEXED mode with GREY Scale index

8. Convert back to RGB Mode, Set White point in levels and Burn the Shadows again

9. Remove unnecessary spots using eraser. Anti-Alias for better viewing, some fine tuning


I was able to recover some portion of this particular "EASIER" page. There are pages that are pathetic. People have used the book for many years. The brittle pages have given way and about a decade back, people applied cellophane tapes over it to prevent them from falling apart. Some pages have even been laminated to "Preserve" them, after so many decades. The result?, the pages look RED/ BROWN. When I try to increase the contrast, I can't differentiate between the text and the page. The background has started blending into the text (or vice versa).

Does any one have enough experience fixing such images? (Will post a sample of the bad page soon).

