Multiresolution Text Analysis

Published in the Workshop on New Paradigms in Information Visualization and Manipulation at the CIKM '96 Conference.

Abstract

The n-gram analysis technique breaks up a text document into several n-character long unique grams, and produces a vector whose components are the counts of these grams. A typical corpus contains hundreds of thousands of such grams. Wavelet compression reduces the dimension of the n-gram vectors, and speeds up document query operations. Document vectors with their dimensions reduced to four components is readily represented in a three dimensional volume.

Related Publications


Amen Zwa (zwa@cs.umbc.edu)
Last modified: 21 November 1996