GENIA POS & Term Corpus



A corpus of 2,000 MEDLINE abstracts, collected using the three MeSH terms human, blood cells and transcription factors. The corpus is available in three formats: 1) A text file containing part-of-speech (POS) annotation, based on the Penn Treebank format, 2) An XML file containing inline POS annotation, 3) A “merged” XML format, containing inline annotations, corresponding to both POS and term annotations

