An effective solution to automate information extraction from Web pages is represented by wrappers. A wrapper associates a Web page with an XML document that represents part of the information in that page in a machine-readable format. Most existing wrapping approaches have traditionally focused on how to generate extraction rules, while they have ignored potential benefits deriving from the use of the schema of the information being extracted in the wrapper evaluation. In this paper, we investigate how the schema of extracted information can be effectively used in both the design and evaluation of a Web wrapper. We define a clean declarative semantics for schema-based wrappers by introducing the notion of (preferred) extraction model, which is essential to compute a valid XML document containing the information extracted from a Web page. We developed the SCRAP (SChema-based wRAPper for web data) system for the proposed schema-based wrapping approach, which also provides visual support tools to the wrapper designer. Moreover, we present a wrapper generalization framework to profitably speed up the design of schemabased wrappers. Experimental evaluation has shown that SCRAP wrappers are not only able to successfully extract the required data, but also they are robust to changes that may occur in the source Web pages.

Schema-based Web wrapping

Fazzinga B;FLESCA, Sergio;TAGARELLI, Andrea
2011

Abstract

An effective solution to automate information extraction from Web pages is represented by wrappers. A wrapper associates a Web page with an XML document that represents part of the information in that page in a machine-readable format. Most existing wrapping approaches have traditionally focused on how to generate extraction rules, while they have ignored potential benefits deriving from the use of the schema of the information being extracted in the wrapper evaluation. In this paper, we investigate how the schema of extracted information can be effectively used in both the design and evaluation of a Web wrapper. We define a clean declarative semantics for schema-based wrappers by introducing the notion of (preferred) extraction model, which is essential to compute a valid XML document containing the information extracted from a Web page. We developed the SCRAP (SChema-based wRAPper for web data) system for the proposed schema-based wrapping approach, which also provides visual support tools to the wrapper designer. Moreover, we present a wrapper generalization framework to profitably speed up the design of schemabased wrappers. Experimental evaluation has shown that SCRAP wrappers are not only able to successfully extract the required data, but also they are robust to changes that may occur in the source Web pages.
semistructured data and XML; information extraction; web wrapping
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/20.500.11770/157759
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 7
social impact