A comparative study on community detection and clustering algorithms for text categorisation

IRIS

One of the main tasks of Text Mining is organising a large number of unlabelled documents into a smaller set of meaningful and coherent clusters, similar with respect to their content. Clustering algorithms are usually carried on documents × terms matrices, algebraically representing each document as a vector. Nevertheless, a collection of documents can also be encoded differently, e.g. by considering a documents × documents representation. This peculiar data structure can be seen as an adjacency matrix and graphically displayed as a graph. In the frame- work of Network Analysis, community detection is performed on such graphs to find groups of nodes sharing common characteristics, and play similar roles. This paper aims at evaluating the use of different data structures and different grouping criteria, showing the effectiveness of the different alternatives in a text categorisation strategy. We performed a comparative study involving both classical text clustering approaches and community detection approaches, testing and discussing their performances.

A comparative study on community detection and clustering algorithms for text categorisation

Michelangelo Misuraca;Germana Scepi;Maria Spano

2020-01-01

Abstract

One of the main tasks of Text Mining is organising a large number of unlabelled documents into a smaller set of meaningful and coherent clusters, similar with respect to their content. Clustering algorithms are usually carried on documents × terms matrices, algebraically representing each document as a vector. Nevertheless, a collection of documents can also be encoded differently, e.g. by considering a documents × documents representation. This peculiar data structure can be seen as an adjacency matrix and graphically displayed as a graph. In the frame- work of Network Analysis, community detection is performed on such graphs to find groups of nodes sharing common characteristics, and play similar roles. This paper aims at evaluating the use of different data structures and different grouping criteria, showing the effectiveness of the different alternatives in a text categorisation strategy. We performed a comparative study involving both classical text clustering approaches and community detection approaches, testing and discussing their performances.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Parole chiave
	
				text clustering, community detection, data representation, weighting schemes, similarity measures
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

				5.12 Altro

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/304767

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact