Interactive document clustering

via Topic Modeling

by Andras Dörömbözi and Timea Toth

Based on the paper iVisClustering: An Interactive Visual Document Clustering via Topic Modeling published by Lee at al. in 2012 we created an interactive dashboard for document clustering.

Dashboard

The goal of this project is to separate document data with similar characteristics and assign them into clusters. Clustering has also several challenges, such as noisy data or outliers. This makes it even more challenging to find an automated clustering algorithm, since every dataset differs, and just because one method works really good on one dataset, it doesn’t mean that it would fit also new data. The amount of online documents is growing really fast in the last years. To cluster these documents,the authors developed a visual analysis system, iVisClustering, that performs interactive clustering for document data.

Overview of the implementation

We were using the gensim package in python for topic modeling and we extended it with our own functions to make it more suitable for this task. The dashboard was implemented using Dash, which can provide interactivity to the user.

General Analysis Procedure: After the automatic clustering (k clusters) the data will be cleaned. The user can perform some cluster level interactions like combining similar clusters and removing clusters. The meaning of each cluster will be refined using the LDA inference algorithm. The last set is to fine-tune the clusters such as reviewing the documents. With this five steps, the user will be able to maintain meaningful clusters for the data.

Views of the dashboard

Corpus of documents

Collection of Infovis and Vast papers between 1997 and 2009, which contains 454 documents. Download data