A New Metric for Measuring both Quantitative and Qualitative Similarity for Cluster Analysis

Jamil Alshaqsi, Wenjia Wang

Research output: Contribution to journalArticle

Abstract

This paper presents a new similarity function that measures not only the quantitative resemblance between data instances but also their qualitative characteristics. The new similarity measure is derived from a novel distance metric, which is theoretically proven to meet the basic properties of the distance
metric. It is naturally normalized to [-1, 1], with the magnitude representing the degree of quantitative similarity, and the sign indicating the qualitative similarity. Moreover, it can cope with both numerical and categorical data and has ability to reduce the influence from extreme values (e.g. outliers) of the features. As an example of application, it is embedded into the classic K-means
clustering algorithm to evaluate its effectiveness in cluster analysis, and compared with some other commonly used distance or similarity measures, such as the Euclidean distance, Hamming distance, Mutual Information, Manhattan distance, and Chebyshev distance, which are also implemented to the classic K-means or its variants: K-modes and K-ANMI. The experimental results on benchmark datasets and the statistical analyses show that, with the same experimental procedure and conditions, the clustering algorithm equipped with the new similarity measure produced significantly better and consistent results than the ones using the other distance or similarity measures, which demonstrates that the new similarity measure is capable of capturing more useful information from data.
Original languageEnglish
JournalIEEE Transactions on Cybernetics
Publication statusSubmitted - Mar 2017

Keywords

  • Simialrity measure
  • Cluster analysis

Cite this