Wrapping PDF Documents: A Preliminary Study

IRIS

In this paper we aim at investigating the problem of extracting information from PDF documents. The main reason behind our research effort is due to the wide tendency in adopting the PDF format to generate print-oriented documents. However, differently from Web wrapping, PDF wrapping poses new challenges: frequently, we have to deal with documents coming from different unknown sources which contain the same kind of information, but with major changes in structure and presentation. We propose a solution based on a novel bottom-up wrapping approach which focuses the extraction task on groups of tokens, i.e. the unstructured, spatial features of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Annotated content models and spatial constraints are also included in the definition of a PDF wrapper. Moreover, we propose an algorithm for extracting token groups that works in polynomial time.

Wrapping PDF Documents: A Preliminary Study

FLESCA, Sergio;S. Garruzzo;E. Masciari;TAGARELLI, Andrea

2005-01-01

Abstract

In this paper we aim at investigating the problem of extracting information from PDF documents. The main reason behind our research effort is due to the wide tendency in adopting the PDF format to generate print-oriented documents. However, differently from Web wrapping, PDF wrapping poses new challenges: frequently, we have to deal with documents coming from different unknown sources which contain the same kind of information, but with major changes in structure and presentation. We propose a solution based on a novel bottom-up wrapping approach which focuses the extraction task on groups of tokens, i.e. the unstructured, spatial features of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Annotated content models and spatial constraints are also included in the definition of a PDF wrapper. Moreover, we propose an algorithm for extracting token groups that works in polynomial time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2005
			
	Codice ISBN
	
				88-548-0122-4
			
	Parole chiave
	
				information extraction; PDF wrapping
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/166394

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact