We propose a novel methodology for clustering XML documents based on their structural similarities. The basic idea is exploiting the notion of XML cluster representative as an XML document subsuming the most typical structural specifics of a set of XML documents. The computation of an XML representative for a cluster consists of three phases: tree matching, which allows the construction of an initial substructure that is common to the XML document trees in a cluster; tree merging, which provides a document including uncommon substructures; tree pruning, which leads to a minimization of the distance between the documents in a cluster and the document built as the cluster representative. Suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees are investigated. Also, experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.

Clustering of XML Documents by Structure based on Tree Matching and Merging

TAGARELLI, Andrea
2004-01-01

Abstract

We propose a novel methodology for clustering XML documents based on their structural similarities. The basic idea is exploiting the notion of XML cluster representative as an XML document subsuming the most typical structural specifics of a set of XML documents. The computation of an XML representative for a cluster consists of three phases: tree matching, which allows the construction of an initial substructure that is common to the XML document trees in a cluster; tree merging, which provides a document including uncommon substructures; tree pruning, which leads to a minimization of the distance between the documents in a cluster and the document built as the cluster representative. Suitable techniques for identifying significant node matchings and for reliably merging and pruning XML trees are investigated. Also, experimental evaluation performed on both synthetic and real data shows the effectiveness of our approach.
2004
88-901409-1-7
semistructured data and XML; XML mining; document clustering
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/179165
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact