Bilingual text corpus Languages
Polish
(3,800,000 Tokens)
English
(4,200,000 Tokens)
Linguality Linguality type: Bilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format
text/xml
(8,000,000 Tokens)
Size Character encoding
UTF - 8
(8,000,000 Tokens)
Domains
law_politics
(8,000,000 Tokens)
Modalities Annotation Segmentation Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Mixed
Start date: 08/01/2011
End date: 09/30/2011
Size:
8,000,000 Tokens
Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic
Start date: 08/01/2011
End date: 09/30/2011
Size:
8,000,000 Tokens
Time Coverage
2004-2011
(8,000,000 Tokens)
Geographic coverage
European Union
(8,000,000 Tokens)
Creation Creation mode details: Semi-automatic acquisition and processing.
Creation mode: Mixed
Original Sources Creation Tools Bilingual text corpus Languages
Polish
(28,600,000 Tokens)
English
(32,400,000 Tokens)
Linguality Linguality type: Bilingual
Multi-linguality type: Parallel (The texts were aligned by the JRC (http://optima.jrc.it) on a sentence level using the statistical aligner hunAlign (http://mokk.bme.hu/resources/hunalign/).)
Text Format
text/xml
(61,000,000 Tokens)
Size Character encoding
UTF - 8
(61,000,000 Tokens)
Domains
law_politics
(61,000,000 Tokens)
Modalities Annotation Segmentation Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Mixed
Start date: 05/29/2008
End date: 05/29/2008
Size:
61,000,000 Tokens
Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic
Start date: 05/29/2008
End date: 05/29/2008
Size:
61,000,000 Tokens
Time Coverage
1958-2006
(61,000,000 Tokens)
Geographic coverage
European Union
(61,000,000 Tokens)
Creation Creation mode details: Semi-automatic acquisition and processing.
Creation mode: Mixed
Original Sources Creation Tools Bilingual text corpus Languages
Polish
(3,200,000 Tokens)
English
(3,500,000 Tokens)
Linguality Linguality type: Bilingual
Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)
Text Format
text/xml
(6,700,000 Tokens)
Size Character encoding
UTF - 8
(6,700,000 Tokens)
Domains
science
(6,700,000 Tokens)
Modalities Annotation Segmentation Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Mixed
Start date: 08/01/2011
End date: 09/30/2011
Size:
6,700,000 Tokens
Alignment Segmentation level: Sentence
Format: text/xml
Standard practices conformance: TEI
Annotation Mode: Automatic
Start date: 08/01/2011
End date: 09/30/2011
Size:
6,700,000 Tokens
Time Coverage
2003-2011
(6,700,000 Tokens)
Geographic coverage
European Union
(6,700,000 Tokens)
Creation Creation mode details: Semi-automatic acquisition and processing.
Creation mode: Mixed
Original Sources Creation Tools