Bulgarian National Corpus




The Bulgarian National Corpus (BulNC) is a large representative publicly available corpus. It is designed as a uniform framework for texts of different modality (written and spoken), period, and number of languages (monolingual and parallel).
Its core incorporates several electronic corpora, developed in the period 2001-2009 but has been substantially expanded in the following years. The corpus reflects the state of the Bulgarian language (mainly in its written form) from 1945 until the present.
The enlargement of the BulNC has involved not only the amassing of Bulgarian texts, but also the compilation of parallel corpora with Bulgarian as a pivot language. The texts in other languages obligatory have a Bulgarian counterpart in the Bulgarian part of the corpus.
Currently, the corpus core consists of over 1.2 billion words and about 240,000 texts. So far 47 foreign languages have been included totalling about 4.2 billion words. Thus, the overall size of the corpus exceeds 5.4 billion words.
All texts are supplied with extensive metadata description compliant with the established standards. The corpus is supplied with three levels of annotation:
• A detailed metadata description: each text is supplied with editorial (author's name, text title, source, etc.) and classificatory metadata (general category, domain, genre).
• Monolingual annotation: tokenisation, sentence splitting, POS tagging, lemmatisation, word sense annotation.
• Multilingual annotation: alignment at different levels, currently sentence and clause level.
The tagset used in the annotation of the BulNC is available as the Bulgarian tagset.
The Bulgarian part and the Bulgarian-English parallel corpus are tokenised, sentence-split, POS tagged and lemmatised; the Bulgarian part is also word sense annotated. For the time being, the corpora for the other languages are tokenised, sentence-split and aligned.
A special corpus search system allows complex queries to be performed. A set of tools was developed for extracting the metadata and compiling the corpus description from the markup formats. The metadata are as detailed as possible in order to ensure easy text classification, corpus evaluation, derivation of subcorpora based on a set of criteria (e.g. publishing year, domain), and others.
The Bulgarian National Corpus Collocation service (http://dcl.bas.bg/collocations/?cmd=collocations&word=%D0%BD%D0%B5%D1%82) gives access to the Bulgarian National Corpus. The service employs the free-of-charge NoSketchEngine, a system for corpora processing that combines Manatee and Bonito. The Collocation service is a RESTful webservice, supporting complicated queries through http. Example: http://dcl.bas.bg/collocations/?cmd=collocations&word=нет
user: bulnc
pass: bulnc
The query returns the collocations of a given word in the NoSketchEngine format.
The system also supports additional arguments, namely all that are accepted by NoSketchEngine, provided with default values.

