Experiments with Haircut
James Mayfield
Johns Hopkins Applied Physics Laboratory
2:00pm Friday October 1, 1999
Lecture Hall V, ECS
The Hopkins Automated Information Retriever for
Combing Unstructured Text (HAIRCUT) system is an experimental
Java-based text retrieval engine. HAIRCUT uses both n-grams (fixed-length
character sequences) and words as indexing terms, and both a vector
space model and a Hidden Markov model for ranking documents. In
this talk, I will describe how HAIRCUT achieves better retrieval
results by combining these techniques than it does using the techniques
individually. I will also report on a sequence of experiments
that compare the efficacy of words and n-grams. Finally, I will
touch on the use of these techniques for cross-language retrieval,
in which a query is posed in English, and documents in one or
more non-English languages are retrieved.