CESAR Aligned Wikipedia Headwords List




The 762,662 entries of the lexicon are built from the Wikipedia dumps of the six CESAR languages by using article titles and interlingual links to English and the remaining five CESAR languages. In the first phase one lexicon for each CESAR language is built after which those lexicons are merged by grouping together all entries that are connected by interlingual links. If more than one article of a language is connected to a group of articles in other languages (which are actually errors in the structure of the Wikipedias), all article titles are retained, divided by a semicolon. An example of such an entry is "Астеци; Империја Астека". In the final phase category information from the English Wikipedia is added with categories divided by semicolons, and for each non-English entry the number of links to that page in the Wikipedia of the respective language is given.

  • Headword lists from Bulgarian, Croatian, Hungarian, Polish, Serbian and Slovak Wikipedia, aligned with English through interlingual links between Wikipedia articles
