In this paper we aim at investigating the problem of extracting information from PDF documents. The main reason behind our research effort is due to the wide tendency in adopting the PDF format to generate print-oriented documents. However, differently from Web wrapping, PDF wrapping poses new challenges: frequently, we have to deal with documents coming from different unknown sources which contain the same kind of information, but with major changes in structure and presentation. We propose a solution based on a novel bottom-up wrapping approach which focuses the extraction task on groups of tokens, i.e. the unstructured, spatial features of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Annotated content models and spatial constraints are also included in the definition of a PDF wrapper. Moreover, we propose an algorithm for extracting token groups that works in polynomial time.
Wrapping PDF Documents: A Preliminary Study
FLESCA, Sergio;TAGARELLI, Andrea
2005-01-01
Abstract
In this paper we aim at investigating the problem of extracting information from PDF documents. The main reason behind our research effort is due to the wide tendency in adopting the PDF format to generate print-oriented documents. However, differently from Web wrapping, PDF wrapping poses new challenges: frequently, we have to deal with documents coming from different unknown sources which contain the same kind of information, but with major changes in structure and presentation. We propose a solution based on a novel bottom-up wrapping approach which focuses the extraction task on groups of tokens, i.e. the unstructured, spatial features of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Annotated content models and spatial constraints are also included in the definition of a PDF wrapper. Moreover, we propose an algorithm for extracting token groups that works in polynomial time.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.