We propose a novel methodology for clustering XML documents based on their structural similarities. The basic idea is exploiting the notion of XML cluster representative as an XML document subsuming the most typical structural specifics of a set of XML documents. The computation of an XML representative for a cluster consists of three phases: tree matching, which allows the construction of an initial substructure that is common to the XML document trees in a cluster; tree merging, which provides a document including uncommon substructures; tree pruning, which leads to a minimization of the distance between the documents in a cluster and the document built as the cluster representative. Suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees are investigated. Also, experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.
Clustering of XML Documents by Structure based on Tree Matching and Merging
TAGARELLI, Andrea
2004-01-01
Abstract
We propose a novel methodology for clustering XML documents based on their structural similarities. The basic idea is exploiting the notion of XML cluster representative as an XML document subsuming the most typical structural specifics of a set of XML documents. The computation of an XML representative for a cluster consists of three phases: tree matching, which allows the construction of an initial substructure that is common to the XML document trees in a cluster; tree merging, which provides a document including uncommon substructures; tree pruning, which leads to a minimization of the distance between the documents in a cluster and the document built as the cluster representative. Suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees are investigated. Also, experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.