Document Image Compression
Xerox Palo Alto Research Center
|
|
Token-Based Compression
|
|
In token-based compression, the pages
of a document are represented as a ``dictionary'' of tokens or symbols
that appear in the document, together with position information
specifying where each token appears. Each token or symbol in the
dictionary is an image of a portion of the document (such as a
character or graphical element). Therefore, multiple occurrences of
the same character in a document would be represented using just a
single image of that character. Each page of the document then
specifies which tokens appear on that page, and their locations on the
page. This method achieves compression ratios 3 to 10 times
better than CCITT Group 4 TIFF files.
|
| My Contribution | | For two summers, I worked with Dan Huttenlocher
on a prototype of the compression engine. This envolved devising
efficient and reliable token matching and clustering algorithms in
order to be able to handle large documents. The prototype was adopted
by Xerox's business divisions for their ScanSoft and DigiPaper
products. For more information on where the project is today, see the
Silx website. |
|