The BREF corpus was designed to provide enough read speech data for the development and evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a large corpus of continuous speech for the acquisition of acoustic-phonetic knowledge of spoken French. All the recorded texts were selected from extracts of the French newspaper Le Monde so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments. The entire BREF corpus contains over 100 hours of speech material from 120 speakers.
The BREF-80 sub-corpus consists of 2 ISO9660 CDROMs, BREF80-1 and BREF80-2, containing speaker-independent training data from 80 speakers. Together these 2 CDs contain 5330 sentences, an average of 67 sentences per speaker. While this data represents only a small portion of the entire BREF corpus, the sentences have been selected to cover most of the BREF training prompts, in order to conserve a wide range of phonetic contexts with a minimum amount of speech data. Thus, the BREF80 sub-corpus produced on these CDs was especially selected to train speaker-independent, vocabulary-independent speech recognizers.

Réalisés au LIMSI-CNRS, ces deux CD contiennent des articles lus, tirés du quotidien "Le Monde". Ces textes ont été sélectionnés pour maximiser le nombre de contextes phonétiques, la taille du vocabulaire est de plus de 20.000 mots. Le corpus représente 5330 phrases produites par 80 locuteurs (soit 67 phrases par locuteur en moyenne). Il s'agit d'une partie du corpus de BREF qui totalise 100 heures d'enregistrement.

