Paper Title/ Authors Name Download View

Enhancing Quality of Text Clustering based on Side Information data

Kajal R. Motwani, B. D. Jitkar

Many text mining applications contain side-information available along with the text documents. Such side information include: document provenance information, links in the document, user-access behavior from web logs, or other non-textual attributes. Such attributes may contain informative data which can be useful for purposes of text clustering. Traditional text clustering techniques are available that yield good amount of clusters but it is hard to say whether the quality of the clusters is as good as expected. Since side information contains meaningful data, it has an advantage that adding such side information into the clusters can greatly enrich the quality of clusters. However, some of the information is noisy, so it must be added carefully. So, we need a correct way to perform the mining process. In this paper, we propose an effective clustering technique that identifies important side information in the documents by using Shannon information gain and Gaussian distribution and using Bayesian probability, it estimates whether adding such side information enriches the quality of clusters. This approach is further extended for generating classification labels which makes it easier to cluster large number of documents.