Bulgarian Sense-Annotated Corpus

79 Last view: 2024-07-09

Bulgarian Sense-Annotated Corpus

BulSemCor

http://dcl.bas.bg/semcor/en/

ID:

804 The Bulgarin Sence-annotated Corpus (BulSemCor) contains sense-disambiguated lexical items defined in the context of occurrence.

The Bulgarian Sense-annotated Corpus follows the methodology of the Princeton University SemCor. As BulSemCor it consists of excerpts from the Brown Corpus of Bulgarian. Each lexical item (simple word, compound word or multiword expression) is assigned manually the unique semantic or grammatical meaning from the Bulgarian wordnet (BulNet) in the particular context.

Contrary to other sense annotated corpora, the BulSemCor covers both open and close class words and all occurences of multiword expressions and named entities.

The annotated lexical units inherit all the information from the synonym sets in the BulNet, incl. explanatory definition, PoS, usage examples, notes on grammatical, stylistic, and pragmatic properties, and all relations (semantic morpho-syntactic and extra-linguistic) pertaining to the synset, as well as the semantic and derivational relations pertaining to the literal. The BulSemCor contains 101 062 tokens, 99 480 annotated lexical units - 86 842 single words, а 5797 multiword expressions.

The BulSemCor is used as training and testing set in the elaboration of a probability based automatic word-sense disambiguation that is applicable in variety of natural language processing tasks such as machine translation, text categorisation, information extraction, among others.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 11/30/2010

Licence

Other

Restrictions: Academic - Non Commercial Use

Execution location: hidden

Distribution Access/Medium: Accessible Through Interface

IPR Holder

Institute for Bulgarian Language

Contact Person

Ivelina Stoyanova

text

Monolingual text corpusLanguages

Bulgarian

Linguality

Linguality type: Monolingual

Size

99,000 Tokens

Character encoding

UTF - 8

Modalities

Written Language

AnnotationSegmentation

Segmentation level: Sentence

Morphosyntactic Annotation - Pos Tagging

Segmentation level: Word

Semantic Annotation - Word Senses

Segmentation level: Word

Lemmatization

Segmentation level: Word

Resource Creation

Resource Creator

Institute for Bulgarian Language

Funding Project

Bulgarian National Corpus project

Funding Type: National Funds

Funding Country: Bulgaria

Project duration: 12/17/2009 - 06/17/2013

Central and South-East European Resources (CESAR)

URL: http://cesar.nytud.hu/

Funding Type: Eu Funds

Project duration: 02/01/2011 - 01/30/2013

Metadata

Created: 11/20/2011

Last Updated: 01/31/2013

Version

Version: 3.0

Last Updated: 11/20/2011

ValidationValidated

Usage

Foreseen UseNlp ApplicationsHuman UseActual Use - Nlp ApplicationsActual Use - Human Use

Documentation

Koeva, Svetla, Svetlozara Lesseva, Ekaterina Tarpomanova, Borislav Rizov, Tsvetana Dimitrova and Hristina Kukova. Bulgarian Sense Annotated Corpus - Results and Achievements. In. Proceedings from the seventh international conference Formal Approaches to South Slavic and Balcan Languages, Dubrovnik, 2010, pp. 41-49. ISSN 978-953-55375-2-6

Koeva, Svetla, Svetlozara Leseva, Maria Todorova, Bulgarian Sense Tagged Corpus. - In: Proceedings of the 5th SALTMIL Confernece on Minority Languages: Strategies for Developing Machine Translation for Minority Languages, Genoa, 2006, pp.79-87.

People who looked at this resource also viewed the following: