Book Scanning

Books are really important to me. I grew up collecting all kinds, with the highest interest in the most-obscure ones. I often read non-fiction, and I'm a picky reader. For years, I wondered about how best to interact with the medium. Through those years, I tried different vehicles, ranging from "writing in the margins" through "highlighting on a Kobo", and I must now say: most ebooks suck. They are rubbish, and lack a deeper vision for this medium. Here are some points I'd like to make, in bullet-pseudo-question form:

In this document, I'll try to answer some of these questions through the outline of my process. It is possible to have a better-than-paper experience with no restrictions. There are many options for scanning books, and almost all of them are expensive, bulky, and inaccessible. The only one that cuts through that (pun intended) is by using a guillotine to cut the spine. This turns the pages into loose-leaf, and the pages may be loaded into a document feed scanner (ADF).

Hardware

I do my scanning with a Fujitsu fi-4120c, which is a USB 1.1 600 DPI duplex feed scanner. It had perfect support with SANE, the most-popular free scanning software, so I purchased one from eBay for about $30. I use this scanner directly with FreeBSD. One thing I do regularly is to clean the sensor glass, which sometimes gets gummy from glue left on cover pages. I've also replaced the two rubber pads that separate pages; these were available on eBay for $5.

Software

The scanner plugs into my computer with USB. From my terminal, I run some commands to scan various kinds of material. For paperback books, I start with the front cover, in-color, at 300 DPI. The rest of the pages are usually monotone. Occasionally, where there are pictures I will use grayscale or color.

I'll describe processing further in a later section, but it basically by stabilizing the pages. Text is cut out, identified for location/features, and re-paged according to precise margins. Optional masks of gray/color areas are identified. OCR is applied. Pages are re-aggregated into a single .pdf, and finally metadata is added such as publishing details and chapter markers. A .djvu derives from the .pdf.

Feeding the Machine

The guillotine I use is located at my nearest UPS store. Because I had a lot of books, and this store was a franchise where the owner was working, we struck a deal on the price (he didn't have a SKU for the service). I get even further discounts when I pre-measure my books. They set the width on their guillotine, the book is inserted, and the spine is chopped off. Measure well, because sometimes there is residual glue stuck between the pages, obstructing page separation. It doesn't hurt to cut more off than you think, without getting into ink, because the text may later be re-paged with precise margins.

Chopped books

Before loading the pages into the scanner, I fan them out a few times to be sure there are no stuck pages. Then, I fan them vertically (so it forms a knife-shape into the scanner) with my leading page most prominent. Using this method I never get paper jams.

I wrote this script for scanning. Use it like: w=130 h=210 t=128 scan.sh front color 300. For paperbacks, I typically test the threshold value against text and two-tone images, and if it looks good in the sample I follow through with this process:

  1. Front cover, Color, 300 DPI
  2. Back of front cover, BW, 600 DPI
  3. Ordinary pages, BW, 600 DPI
  4. Multi-tone pages, Gray, 600 DPI
  5. Graphics, Color 300 DPI
  6. Front of back cover, BW, 600 DPI
  7. Back cover, Color, 300 DPI

The result is a directory containing only PNG images. At this point, I make an archive of the pages so I have a backup if something goes wrong during processing.

Image Processing

Using a very simple script, I re-page (crop and flush the border). It's just a little pnmtools mixed in with imagemagick. If I ever find the time, I intend to write a more-nuanced program that will statistically average the text borders, look at line positions, and actually re-page the text so the lines of paragraphs are colinear between pages, giving a nice experience. For now, I'm cheap and time's s'pensive. Below are before and after images of this process.

Before and after repaging

Metadata

Yet to be written

OCR

It is rather trivial to convert these images to PDF, with a text overlay, using tesseract. As usual, sprinkle in some parallel, if you can muster.

\ls *.png | parallel tesseract {} ../crop/{/.} --dpi 600 -l eng pdf

I always do OCR as my last step before uniting the individual pages into a single PDF because I don't want the text to get messed up.

Bookmarking and Extras

I've been meaning to get around to automatic chapter detection, but for now I use a good old spreadsheet to record the chapter placements. First, I measure the distance between Page 1 and the file index (e.g. p. 1 is 26.png so the difference is 25). Then, add that offset to the page numbers in the Table of Contents. So, if Chapter 3 starts on Page 43 in the ToC, that means the bookmark should appear on Page 68 (43+25).

Eventually, you'll get to write a text document that looks like this. In that example, I have also added a Title, Subject, Author, ISBN number, and bookmarks at different indentation levels.

Page sizing

Output and Reading Experience

Yet to be written

Here is a sample of a textbook scanned using a modified process, whereby grayscale and color areas are masked, seperated into individually-compressed layers, and recomposited. This gives a great level of detail with a medium file size, which works well for converting into very-compressible .djvu files.

Comparision of file formats

Devices

Appendix A: Topology Mapping with Microsoft Kinect

3D image of book from KinectBefore I decided to cut the spines off my books, I investigated using a Kinect to make a 3D point-cloud of the book's pages, then use that to dewarp a high-resolution image. I did some searching, but only a couple people played around with the idea...eight years ago! After investigating this myself, I determined it was a fool's path.

First, there's a problem with the software. It is difficult to source the software that communicates and controls a Microsoft Xbox 360 Kinect, and there is another version (my kind) that was made for PC, with a different model number. I was able to overcome this limitation by using Robot Operating System (ROS).

Second, even though I was now able to obtain the point cloud data, there was a problem: it was extremely noisy! To add, the depth resolution of the Kinect wasn't that great, either. At this point, I thought it was a neat idea, but didn't have the patience to continue pursuing it. There are better ideas, like shining a LASER line at an angle on the page, which could give a simpler, better measure. I ended up chopping my books, because I no longer saw it as destructive; done right, it was a transformation that made them more-valuable, not less.

Shown above, is a point cloud in MeshLab. At left is my rolling chair, center the book, and upper-right is a crumpled tissue.

Go home