Abstract
Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. In this paper we focus on clustering data derived from Autoregressive Moving Average (ARMA) models using k-means and k-medoids algorithms with the Euclidean distance between estimated model parameters. We justify our choice of clustering technique and distance metric by reproducing results obtained in related research. Our research aim is to assess the affects of discretising data into binary sequences of above and below the median, a process known as clipping, on the clustering of time series. It is known that the fitted AR parameters of clipped data tend asymptotically to the parameters for unclipped data. We exploit this result to demonstrate that for long series the clustering accuracy when using clipped data from the class of ARMA models is not significantly different to that achieved with unclipped data. Next we show that if the data contains outliers then using clipped data produces significantly better clusterings. We then demonstrate that using clipped series requires much less memory and operations such as distance calculations can be much faster. Finally, we demonstrate these advantages on three real world data sets.
Original language | English |
---|---|
Pages | 49-58 |
Number of pages | 10 |
DOIs | |
Publication status | Published - Aug 2004 |
Event | 10thACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Seattle, United States Duration: 22 Aug 2004 → 25 Aug 2004 |
Conference
Conference | 10thACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
---|---|
Abbreviated title | KDD '04 |
Country/Territory | United States |
City | Seattle |
Period | 22/08/04 → 25/08/04 |