In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.

Topic-Based Hard Clustering of Documents Using Generative Models

TAGARELLI, Andrea
2009-01-01

Abstract

In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.
978-3-642-04124-2
document clustering; topic modeling; information-theoretic clustering
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/163636
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 2
social impact