Medical data is often presented as free text in the form of medical reports. Such documents contain important information about patients, disease progression and management, but are difficult to analyse with conventional data mining techniques due to their unstructured nature. Clustering the medical documents into small number of meaningful clusters may facilitate discovering patterns by allowing us to extract a number of relevant features from each cluster, thus introducing structure into the data and facilitating the application of conventional data mining techniques. For this approach to work, it is essential to produce high-quality clustering. Thus, the main goals of this paper are (1) to experimentally evaluate the performance of six criterion functions in the context of partitional clustering approach, (2) to compare the clustering results of agglomerative approach and partitional approach for each of the criterion functions using real-world medical documents, and (3) to establish the right clustering algorithm to produce high quality clustering of real-world medical documents in order to discover hidden knowledge by analyzing the produced clusters. Our experimental results show that the clustering solutions produced by the agglomerative approach are consistently better than those produced by the partitional approach for all the criterion functions. Moreover, the results show that different criterion functions lead to substantially different results. In addition, we examine the quality of the features produced for each cluster for a classification task. The task involves discriminating between successful and unsuccessful procedures. The features extracted are used to produce an accurate classification of the data.
|Number of pages
|Published - 2006
|2006 International Conference on Data Mining - Las Vegas, United States
Duration: 26 Jun 2006 → 29 Jun 2006
|2006 International Conference on Data Mining
|26/06/06 → 29/06/06