A corpus is a well sized collection of structured text, (e.g. articles, novels, legal documents, blog posts, oral transcriptions, etc.) which is compiled based on specific goals. There is a growing interest in the use of corpora and therefore their constitution from both a qualitative and quantitative point of view is crucial. The attention in this paper, however, is put on corpus size. In particular, this paper presents an introductory study related to the determination of the minimum corpus size measured in terms of number of texts. The study focuses on a statistical technique commonly used for the determination of the sample size from a population with unknown size and variance. Measures of lexical richness are embedded in the proposed method to assess the quality. The technique is tested in the compilation of a specialist corpus for tourism in Italy. Findings provided by our numerical results suggest that the proposed statistical technique is worth further investigation so that it can be used as a standard decision support tool in corpus size definition.

A statistical method for minimum corpus size determination

Caruso A.
;
Folino A
;
Parisi F.;
2014-01-01

Abstract

A corpus is a well sized collection of structured text, (e.g. articles, novels, legal documents, blog posts, oral transcriptions, etc.) which is compiled based on specific goals. There is a growing interest in the use of corpora and therefore their constitution from both a qualitative and quantitative point of view is crucial. The attention in this paper, however, is put on corpus size. In particular, this paper presents an introductory study related to the determination of the minimum corpus size measured in terms of number of texts. The study focuses on a statistical technique commonly used for the determination of the sample size from a population with unknown size and variance. Measures of lexical richness are embedded in the proposed method to assess the quality. The technique is tested in the compilation of a specialist corpus for tourism in Italy. Findings provided by our numerical results suggest that the proposed statistical technique is worth further investigation so that it can be used as a standard decision support tool in corpus size definition.
2014
978-2-9547781-1-2
corpus size; quantitative approach; text statistics
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/272937
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact