Document Image Compression
Xerox Palo Alto Research Center
Token-Based Compression
In token-based compression, the pages of a document are represented as a ``dictionary'' of tokens or symbols that appear in the document, together with position information specifying where each token appears. Each token or symbol in the dictionary is an image of a portion of the document (such as a character or graphical element). Therefore, multiple occurrences of the same character in a document would be represented using just a single image of that character. Each page of the document then specifies which tokens appear on that page, and their locations on the page. This method achieves compression ratios 3 to 10 times better than CCITT Group 4 TIFF files.
My Contribution
For two summers, I worked with Dan Huttenlocher on a prototype of the compression engine. This envolved devising efficient and reliable token matching and clustering algorithms in order to be able to handle large documents. The prototype was adopted by Xerox's business divisions for their ScanSoft and DigiPaper products. For more information on where the project is today, see the Silx website.