CIKM 2005 Tutorial

Clustering Large and High-Dimensional Data


Charles Nicholas, Department of Computer Science and Electrical Engineering, UMBC

Jacob Kogan, Department of Mathematics and Statstics, UMBC

Marc Teboulle, School of Mathematical Sciences, Tel-Aviv University


This page is under occasional construction. Comments and corrections are welcome!

The current version of the tutorial: Nicholas (pdf) Kogan (pdf) Teboulle (pdf). A two-up handout form of Nicholas (pdf)

From the Tutorial:

Alan F. Smeaton, Gary Keogh, Cathal Gurrin, Kieran McDonald and Tom Sødring, "Analysis of papers from twenty-five years of SIGIR conferences: what have we been doing for the last quarter of a century?", ACM SIGIR Forum Volume 36 , Issue 2 Fall 2002 pages: 39 - 43. (pdf)

C. J. van Rijsbergen, Information Retrieval ,Second Edition. page 103

E. Rasmussen,"Clustering Algorithms", in Information Retrieval Data Structures and Algorithms, William Frakes and Ricardo Baeza-Yates, editors, Prentice Hall, 1992

The whisky dendrogram is from
The clustering software used is available from Clustan
(The presenters have no commercial interest in Clustan or any other software vendor mentioned in this tutoral.)

Octave code for single link clustering, complete link clustering, and comparison. Requires Octave to demonstrate. (Could be converted to matlab with some effort.)

The meta-search engine, featuring clustering of search results. See also For more information on search engines that cluster their results, see and

The Reuters collection is available at




Star Clustering:

Parallel and/or Distributed Clustering:

High Dimensionality:

Papers by the Presenters:

Some Clustering Software:

The clustering software used in the whisky study is available from Clustan

CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters.

Frank Dellaert's clustering software for Matlab.