Distance Measures for Clustering Timeseries

Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta, "Distance Measures for Effective Clustering of ARIMA Time-Series". In the Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM'01), San Jose, CA, November 29-December 2, 2001, pp. 273-280.

Abstract

Many environmental and socioeconomic time--series data can be adequately modeled using Auto-Regressive Integrated Moving Average (ARIMA) models. We call such time--series ARIMA time--series. We consider the problem of clustering ARIMA time--series. We propose the use of the Linear Predictive Coding (LPC) cepstrum of time--series for clustering ARIMA time--series, by using the Euclidean distance between the LPC cepstra of two time--series as their dissimilarity measure. We demonstrate that LPC cepstral coefficients have the desired features for accurate clustering and efficient indexing of ARIMA time--series. For example, few LPC cepstral coefficients are sufficient in order to discriminate between time--series that are modeled by different ARIMA models. In fact this approach requires fewer coefficients than traditional approaches, such as DFT and DWT. The proposed distance measure can be used for measuring the similarity between different ARIMA models as well.

We cluster ARIMA time--series using the Partition Around Medoids method with various similarity measures. We present experimental results demonstrating that using the proposed measure we achieve significantly better clusterings of ARIMA time--series data as compared to clusterings obtained by using other traditional similarity measures, such as DFT, DWT, PCA, etc. Experiments were performed both on simulated as well as real data.

Keywords: timeseries, similarity measures, clustering, ARIMA models, cepstral coefficients.

Real Timeseries Datasets used in the paper:

Accompanying Technical Report TR-CS-01-04