The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.

The HiLeX System for Semantic Information Extraction

MANNA, MARCO;ALVIANO, Mario;LEONE, Nicola
2012-01-01

Abstract

The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.
2012
Answer set programming, Business customers, Digital information, Document formats, Explosive growth, High precision, Information extraction, Precision and recall, Real-world application, Regular expressions, Semantic information, Semi-structured, Structur
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/134964
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? ND
social impact