A corpus is a well sized collection of structured text, (e.g. articles, novels, legal documents, blog posts, oral transcriptions, etc.) which is compiled based on specific goals. There is a growing interest in the use of corpora and therefore their constitution from both a qualitative and quantitative point of view is crucial. The attention in this paper, however, is put on corpus size. In particular, this paper presents an introductory study related to the determination of the minimum corpus size measured in terms of number of texts. The study focuses on a statistical technique commonly used for the determination of the sample size from a population with unknown size and variance. Measures of lexical richness are embedded in the proposed method to assess the quality. The technique is tested in the compilation of a specialist corpus for tourism in Italy. Findings provided by our numerical results suggest that the proposed statistical technique is worth further investigation so that it can be used as a standard decision support tool in corpus size definition.
A statistical method for minimum corpus size determination
Caruso A.
;Folino A
;Parisi F.;
2014-01-01
Abstract
A corpus is a well sized collection of structured text, (e.g. articles, novels, legal documents, blog posts, oral transcriptions, etc.) which is compiled based on specific goals. There is a growing interest in the use of corpora and therefore their constitution from both a qualitative and quantitative point of view is crucial. The attention in this paper, however, is put on corpus size. In particular, this paper presents an introductory study related to the determination of the minimum corpus size measured in terms of number of texts. The study focuses on a statistical technique commonly used for the determination of the sample size from a population with unknown size and variance. Measures of lexical richness are embedded in the proposed method to assess the quality. The technique is tested in the compilation of a specialist corpus for tourism in Italy. Findings provided by our numerical results suggest that the proposed statistical technique is worth further investigation so that it can be used as a standard decision support tool in corpus size definition.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.