A main challenge in wrapping web data is to make wrappers robust w.r.t. variations in HTML sources, reducing human effort as much as possible. In this paper we develop a new approach to speed up the specification of robust wrappers, allowing the wrapper designer to not care about detailed definition of extraction rules. The key-idea is to enable a schema-based wrapping system to automatically generalize an original wrapper w.r.t. a set of example HTML documents. To accomplish this objective, we propose to exploit the notions of extraction rule and wrapper subsumption for computing a most general wrapper which still shares the extraction schema with the original wrapper, while maximizes the generalization of extraction rules w.r.t. the set of example documents.

Learning Robust Web Wrappers

FAZZINGA B.;FLESCA, Sergio;TAGARELLI, Andrea
2005-01-01

Abstract

A main challenge in wrapping web data is to make wrappers robust w.r.t. variations in HTML sources, reducing human effort as much as possible. In this paper we develop a new approach to speed up the specification of robust wrappers, allowing the wrapper designer to not care about detailed definition of extraction rules. The key-idea is to enable a schema-based wrapping system to automatically generalize an original wrapper w.r.t. a set of example HTML documents. To accomplish this objective, we propose to exploit the notions of extraction rule and wrapper subsumption for computing a most general wrapper which still shares the extraction schema with the original wrapper, while maximizes the generalization of extraction rules w.r.t. the set of example documents.
2005
3-540-28566-0
information extraction; web wrapping; wrapper maintenance
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/187203
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 3
social impact