A Clustering Accuracy Comparison Framework
Clustering is a data mining problem of dividing documents into groups, such that documents in one group are more similar than those in other groups. The aim of this study is to propose a framework for comparing the accuracy of clustering algorithms. The study applies qualitative research through document analysis to review previous clustering algorithms’ comparisons so as to obtain the issues/problems with such previous comparisons. We then deduce appropriate comparisons framework that addresses the problems. The study obtained the following comparison issues: Nature of comparison, nature of data, size of data, source of data, evaluation metrics, and parameter settings. Consequently, the study proposed the rules, formulae, and procedures needed to be used in a comparison. It is recommended that applying this framework will ensure that such evaluations and comparisons are done using formal procedures that will yield dependable results. This study suggests a further study to be done to apply this framework and do a comprehensive comparison of some clustering algorithms.
Chen, J. (2005). Comparison of Clustering Algorithms and its Application to Document Clustering. PhD Thesis. Princeton University.
Chen, Y. Qin, B. Liu, T. Liu, Y. & Li, S (2010). The Comparison of SOM and K-means for Text Clustering. International Journal of Computer and Information Science, 3(2).
Greene, D. (2007). A State-of-the-Art Toolkit for Document Clustering. PhD Thesis. University of Dublin.
Hao, Z. (2012). A New Text Clustering Method Based on KGA. Journal of Software, 7(5), pp. 1-5.
Jiang, D., Pei, J. & Zhang, A. (2003). DHC: A Density-based Hierarchical Clustering Method for Time Series Gene Expression Data. Proceedings of Third IEEE Symposium on Bioinformatics and Bioengineering 10-12 March 2003, pp. 393 – 400, print ISBN: 0-7695-1907-5.
Müller E., Günnemann S., Assent I., Seidl T. (2009). Evaluating Clustering in Subspace Projections of High Dimensional Data http://dme.rwth-aachen.de/OpenSubspace/. In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France.
Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L., & Zitzler, E. (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Oxford University Press, 22(9).
Shtern, M. (2010). Methods for Evaluating, Selecting And Improving Software Clustering Algorithms. PhD Thesis, York University.
Verma, M., Srivastava, M., Chack, N., Diswar, A. & Gupta, N. (2012). A Comparative Study of Various Clustering Algorithms in Data Mining. International Journal of Engineering Research and Applications (IJERA), 2(3).