In this paper we aim at investigating the problem of extracting information from PDF documents. The main reason behind our research effort is due to the wide tendency in adopting the PDF format to generate print-oriented documents. However, differently from Web wrapping, PDF wrapping poses new challenges: frequently, we have to deal with documents coming from different unknown sources which contain the same kind of information, but with major changes in structure and presentation. We propose a solution based on a novel bottom-up wrapping approach which focuses the extraction task on groups of tokens, i.e. the unstructured, spatial features of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Annotated content models and spatial constraints are also included in the definition of a PDF wrapper. Moreover, we propose an algorithm for extracting token groups that works in polynomial time.

Wrapping PDF Documents: A Preliminary Study

FLESCA, Sergio;TAGARELLI, Andrea
2005-01-01

Abstract

In this paper we aim at investigating the problem of extracting information from PDF documents. The main reason behind our research effort is due to the wide tendency in adopting the PDF format to generate print-oriented documents. However, differently from Web wrapping, PDF wrapping poses new challenges: frequently, we have to deal with documents coming from different unknown sources which contain the same kind of information, but with major changes in structure and presentation. We propose a solution based on a novel bottom-up wrapping approach which focuses the extraction task on groups of tokens, i.e. the unstructured, spatial features of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Annotated content models and spatial constraints are also included in the definition of a PDF wrapper. Moreover, we propose an algorithm for extracting token groups that works in polynomial time.
2005
88-548-0122-4
information extraction; PDF wrapping
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/166394
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact