Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.

Semantic Clustering of XML Documents

TAGARELLI, Andrea;GRECO, Sergio
2010-01-01

Abstract

Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.
2010
Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.
XML document clustering; XML structure and content mining; tree tuples; semistructured data and XML
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/129537
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 48
  • ???jsp.display-item.citation.isi??? 33
social impact