Multilingual text corpus Languages
Slovenian
(1,359,000 Words)
Language Script: Latn
Swedish
(1,403,000 Words)
Language Script: Latn
Bulgarian
(1,070,000 Words)
Language Script: Cyrl
English
(1,985,000 Words)
Language Script: Latn
Greek
(1,650,000 Words)
Language Script: Grek
Estonian
(987,000 Words)
Language Script: Latn
Spanish
(1,911,000 Words)
Language Script: Latn
Czech
(1,401,000 Words)
Language Script: Latn
German
(1,565,000 Words)
Language Script: Latn
Danish
(1,256,000 Words)
Language Script: Latn
French
(2,152,000 Words)
Language Script: Latn
Finnish
(1,069,000 Words)
Language Script: Latn
Italian
(2,127,000 Words)
Language Script: Latn
Hungarian
(1,205,000 Words)
Language Script: Latn
Latvian
(1,127,000 Words)
Language Script: Latn
Lithuanian
(1,118,000 Words)
Language Script: Latn
Dutch
(1,454,000 Words)
Language Script: Latn
Maltese
(1,134,000 Words)
Language Script: Latn
Portuguese
(1,725,000 Words)
Language Script: Latn
Polish
(1,514,000 Words)
Language Script: Latn
Slovak
(1,331,000 Words)
Language Script: Latn
Romanian
(1,269,000 Words)
Language Script: Latn
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
31,810,000 Words
60,120 Texts
Character encoding
UTF - 8
(60,120 Texts)
Domains
law_politics
(60,120 Texts)
Modalities
Written Language
(60,120 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 06/01/2012
End date: 06/30/2012
Size:
60,120 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI_P5
Annotation Mode: Automatic
Start date: 06/01/2012
End date: 06/30/2012
Size:
60,120 Texts
Time Coverage
2005-2012
(60,120 Texts)
Geographic coverage
European Union
(60,120 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools Multilingual text corpus Languages
Swedish
(3,518,000 Words)
Language Script: Latn
Turkish
(5,200 Words)
Language Script: Latn
Romanian
(3,196,000 Words)
Language Script: Latn
Russian
(2,000 Words)
Language Script: Cyrl
Slovak
(3,426,000 Words)
Language Script: Latn
Slovenian
(3,463,000 Words)
Language Script: Latn
Dutch
(4,229,000 Words)
Language Script: Latn
Norwegian
(6,400 Words)
Language Script: Latn
Polish
(4,533,000 Words)
Language Script: Latn
Portuguese
(4,311,000 Words)
Language Script: Latn
Arabic
(1,320 Words)
Language Script: Arab
German
(4,698,000 Words)
Language Script: Latn
Danish
(3,582,000 Words)
Language Script: Latn
English
(4,958,000 Words)
Language Script: Latn
Greek
(4,388,000 Words)
Language Script: Grek
Belarussian
(311 Words)
Language Script: Cyrl
Czech
(3,519,000 Words)
Language Script: Latn
Bulgarian
(2,951,000 Words)
Language Script: Cyrl
Estonian
(2,794,000 Words)
Language Script: Latn
Spanish
(5,234,000 Words)
Language Script: Latn
French
(5,627,000 Words)
Language Script: Latn
Finnish
(2,691,000 Words)
Language Script: Latn
Croatian
(3,300 Words)
Language Script: Latn
Irish
(282,000 Words)
Language Script: Latn
Icelandic
(2,900 Words)
Language Script: Latn
Hungarian
(3,533,000 Words)
Language Script: Latn
Lithuanian
(3,069,000 Words)
Language Script: Latn
Italian
(4,790,000 Words)
Language Script: Latn
Maltese
(3,193,000 Words)
Language Script: Latn
Latvian
(2,907,000 Words)
Language Script: Latn
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
84,910,000 Words
88,332 Texts
Character encoding
UTF - 8
(88,332 Texts)
Domains
law_politics
(88,332 Texts)
Modalities
Written Language
(88,332 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 06/01/2012
End date: 06/30/2012
Size:
88,332 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI_P5
Annotation Mode: Automatic
Start date: 06/01/2012
End date: 06/30/2012
Size:
88,332 Texts
Time Coverage
2004-2012
(88,332 Texts)
Geographic coverage
European Union
(88,332 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools Multilingual text corpus Languages
Czech
(107,000 Words)
Language Script: Latn
Finnish
(78,000 Words)
Language Script: Latn
Spanish
(129,000 Words)
Language Script: Latn
Icelandic
(99,000 Words)
Language Script: Latn
French
(134,000 Words)
Language Script: Latn
German
(125,000 Words)
Language Script: Latn
English
(114,000 Words)
Language Script: Latn
Danish
(117,000 Words)
Language Script: Latn
Dutch
(115,000 Words)
Language Script: Latn
Italian
(119,000 Words)
Language Script: Latn
Polish
(104,000 Words)
Language Script: Latn
Norwegian
(115,000 Words)
Language Script: Latn
Russian
(52,000 Words)
Language Script: Cyrl
Portuguese
(136,000 Words)
Language Script: Latn
Turkish
(83,000 Words)
Language Script: Latn
Swedish
(116,000 Words)
Language Script: Latn
Ukrainian
(71,000 Words)
Language Script: Cyrl
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
1,814,000 Words
1,728 Texts
Character encoding
UTF - 8
(1,728 Texts)
Domains Modalities
Written Language
(1,728 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 06/01/2012
End date: 06/30/2012
Size:
1,728 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI_P5
Annotation Mode: Automatic
Start date: 06/01/2012
End date: 06/30/2012
Size:
1,728 Texts
Time Coverage
2009-2012
(1,728 Texts)
Geographic coverage
European Union
(1,728 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools Multilingual text corpus Languages
Italian
(4,247,000 Words)
Language Script: Latn
English
(3,907,000 Words)
Language Script: Latn
German
(3,788,000 Words)
Language Script: Latn
French
(4,456,000 Words)
Language Script: Latn
Spanish
(4,558,000 Words)
Language Script: Latn
Polish
(3,581,000 Words)
Language Script: Latn
Linguality Linguality type: Multilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format Size
24,539,000 Words
67,787 Texts
Character encoding
UTF - 8
(67,787 Texts)
Domains Modalities
Written Language
(67,787 Texts)
Annotation Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)
Annotation Mode: Automatic
Start date: 08/01/2011
End date: 06/30/2012
Size:
67,787 Texts
Segmentation StandOff: False
Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI_P5
Annotation Mode: Automatic
Start date: 08/01/2011
End date: 06/30/2012
Size:
67,787 Texts
Time Coverage
2003-2012
(67,787 Texts)
Geographic coverage
European Union
(67,787 Texts)
Creation Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.
Creation mode: Mixed
Original Sources Creation Tools