Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.
Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.
Semantic Clustering of XML Documents
TAGARELLI, Andrea;GRECO, Sergio
2010-01-01
Abstract
Dealing with structure and content semantics underlying semistructured documents is challenging for any task of document management and knowledge discovery conceived for such data. In this work we address the novel problem of clustering semantically related XML documents according to their structure and content features. XML features are generated by enriching syntactic with semantic information based on a lexical knowledge base. The backbone of the proposed framework for the semantic clustering of XML documents is a data representation model that exploits the notion of tree tuple to identify semantically cohesive substructures in XML documents and represent them as transactional data. This framework is equipped with two clustering algorithms based on different paradigms, namely centroid-based partitional clustering and frequent-itemset-based hierarchical clustering. An extensive experimental evaluation was conducted on real data sets from various domains, showing the significance of our approach as a solution for the semantic clustering of XML documents.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.