Polish Sejm Corpus




The Polish Sejm Corpus contains annotated utterances of Polish Sejm members from terms of office 1-6 (years 1991-2011). Corpus files contain information about text segmentation (paragraphs, sentences, tokens), disambiguated morphosyntactic description (lemma, POS tag, MSD tag), syntactic description (syntactic words and groups) and named entities (person names, locations, organization).

The data is a valuable source of linguistic information, being a large (100 M segments) collection of quasi-spoken content and making the basis of the audio/video recording of sessions, started in 2011 and planned to be consecutively appended to the corpus.

  • Nerf
  • Pantera
  • Morfeusz SGJP
  • Spejd
  • Spejd, a shallow parser of Polish
  • Pantera, a Brill tagger for Polish
  • Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish
  • Nerf, a named entity recognizer for Polish
