PELCRA mutlilingual parallel corpora (CC-BY)

129 Last view: 2024-10-26

PELCRA mutlilingual parallel corpora (CC-BY)

PELCRA-PAR-3

http://pelcra.pl/res/parallel/pelcra-par-3

ID:

508 A subset of the PELCRA Polish parallel corpora licensed under the CC-BY license. This resource contains 11300 texts in 6 languages from the CORDIS website, 5556 texts in 28 languages from the RAPID site, 3037 press releases of the European Parliament in 22 languages and 109 press releases of the European Southern Observatory in 17 languages. The texts are sentence-aligned with the mAligna aligner using the Church & Gale algorithm. The texts are provided as TEI P5-compliant XML files with custom PELCRA extensions and in the XLIFF format.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 06/30/2012

Licence

CC - BY

Restrictions: Attribution

Download location: hidden

Distribution Access/Medium: Downloadable

Attribution Details: Pęzik P., Ogrodniczuk M., Przepiórkowski A (2011). Parallel and spoken corpora in an open repository of Polish language resources. Human Language Technologies as a Challenge for Computer Science and Linguistics. LTC Poznań 2011.

Licensors:

Piotr Pęzik

Distribution rights holders:

IPR Holder

Contact Persons

text

Multilingual text corpusLanguages

Slovenian (1,359,000 Words)

Language Script: Latn

Swedish (1,403,000 Words)

Language Script: Latn

Bulgarian (1,070,000 Words)

Language Script: Cyrl

English (1,985,000 Words)

Language Script: Latn

Greek (1,650,000 Words)

Language Script: Grek

Estonian (987,000 Words)

Language Script: Latn

Spanish (1,911,000 Words)

Language Script: Latn

Czech (1,401,000 Words)

Language Script: Latn

German (1,565,000 Words)

Language Script: Latn

Danish (1,256,000 Words)

Language Script: Latn

French (2,152,000 Words)

Language Script: Latn

Finnish (1,069,000 Words)

Language Script: Latn

Italian (2,127,000 Words)

Language Script: Latn

Hungarian (1,205,000 Words)

Language Script: Latn

Latvian (1,127,000 Words)

Language Script: Latn

Lithuanian (1,118,000 Words)

Language Script: Latn

Dutch (1,454,000 Words)

Language Script: Latn

Maltese (1,134,000 Words)

Language Script: Latn

Portuguese (1,725,000 Words)

Language Script: Latn

Polish (1,514,000 Words)

Language Script: Latn

Slovak (1,331,000 Words)

Language Script: Latn

Romanian (1,269,000 Words)

Language Script: Latn

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)

Text Format

text/xml (60,120 Texts)

Size

31,810,000 Words

60,120 Texts

Character encoding

UTF - 8 (60,120 Texts)

Domains

law_politics (60,120 Texts)

Modalities

Written Language (60,120 Texts)

AnnotationAlignment

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)

Annotation Mode: Automatic

Annotation Tools:

mAligna (http://align.sourcef...)

Start date: 06/01/2012

End date: 06/30/2012

Size: 60,120 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Segmentation

StandOff: False

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI_P5

Annotation Mode: Automatic

Annotation Tools:

LanguageTool (http://languagetool.org)

Start date: 06/01/2012

End date: 06/30/2012

Size: 60,120 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Time Coverage

2005-2012 (60,120 Texts)

Geographic coverage

European Union (60,120 Texts)

Creation

Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.

Creation mode: Mixed

Original Sources

http://www.europarl....

Creation Tools

WebLign (http://code.google.c...)

Multilingual text corpusLanguages

Swedish (3,518,000 Words)

Language Script: Latn

Turkish (5,200 Words)

Language Script: Latn

Romanian (3,196,000 Words)

Language Script: Latn

Russian (2,000 Words)

Language Script: Cyrl

Slovak (3,426,000 Words)

Language Script: Latn

Slovenian (3,463,000 Words)

Language Script: Latn

Dutch (4,229,000 Words)

Language Script: Latn

Norwegian (6,400 Words)

Language Script: Latn

Polish (4,533,000 Words)

Language Script: Latn

Portuguese (4,311,000 Words)

Language Script: Latn

Arabic (1,320 Words)

Language Script: Arab

German (4,698,000 Words)

Language Script: Latn

Danish (3,582,000 Words)

Language Script: Latn

English (4,958,000 Words)

Language Script: Latn

Greek (4,388,000 Words)

Language Script: Grek

Belarussian (311 Words)

Language Script: Cyrl

Czech (3,519,000 Words)

Language Script: Latn

Bulgarian (2,951,000 Words)

Language Script: Cyrl

Estonian (2,794,000 Words)

Language Script: Latn

Spanish (5,234,000 Words)

Language Script: Latn

French (5,627,000 Words)

Language Script: Latn

Finnish (2,691,000 Words)

Language Script: Latn

Croatian (3,300 Words)

Language Script: Latn

Irish (282,000 Words)

Language Script: Latn

Icelandic (2,900 Words)

Language Script: Latn

Hungarian (3,533,000 Words)

Language Script: Latn

Lithuanian (3,069,000 Words)

Language Script: Latn

Italian (4,790,000 Words)

Language Script: Latn

Maltese (3,193,000 Words)

Language Script: Latn

Latvian (2,907,000 Words)

Language Script: Latn

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)

Text Format

text/xml (88,332 Texts)

Size

84,910,000 Words

88,332 Texts

Character encoding

UTF - 8 (88,332 Texts)

Domains

law_politics (88,332 Texts)

Modalities

Written Language (88,332 Texts)

AnnotationAlignment

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)

Annotation Mode: Automatic

Annotation Tools:

mAligna (http://align.sourcef...)

Start date: 06/01/2012

End date: 06/30/2012

Size: 88,332 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Segmentation

StandOff: False

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI_P5

Annotation Mode: Automatic

Annotation Tools:

LanguageTool (http://languagetool.org)

Start date: 06/01/2012

End date: 06/30/2012

Size: 88,332 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Time Coverage

2004-2012 (88,332 Texts)

Geographic coverage

European Union (88,332 Texts)

Creation

Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.

Creation mode: Mixed

Original Sources

http://europa.eu/rapid/

Creation Tools

WebLign (http://code.google.c...)

Multilingual text corpusLanguages

Czech (107,000 Words)

Language Script: Latn

Finnish (78,000 Words)

Language Script: Latn

Spanish (129,000 Words)

Language Script: Latn

Icelandic (99,000 Words)

Language Script: Latn

French (134,000 Words)

Language Script: Latn

German (125,000 Words)

Language Script: Latn

English (114,000 Words)

Language Script: Latn

Danish (117,000 Words)

Language Script: Latn

Dutch (115,000 Words)

Language Script: Latn

Italian (119,000 Words)

Language Script: Latn

Polish (104,000 Words)

Language Script: Latn

Norwegian (115,000 Words)

Language Script: Latn

Russian (52,000 Words)

Language Script: Cyrl

Portuguese (136,000 Words)

Language Script: Latn

Turkish (83,000 Words)

Language Script: Latn

Swedish (116,000 Words)

Language Script: Latn

Ukrainian (71,000 Words)

Language Script: Cyrl

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)

Text Format

text/xml (1,728 Texts)

Size

1,814,000 Words

1,728 Texts

Character encoding

UTF - 8 (1,728 Texts)

Domains

science (1,728 Texts)

Modalities

Written Language (1,728 Texts)

AnnotationAlignment

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)

Annotation Mode: Automatic

Annotation Tools:

mAligna (http://align.sourcef...)

Start date: 06/01/2012

End date: 06/30/2012

Size: 1,728 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Segmentation

StandOff: False

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI_P5

Annotation Mode: Automatic

Annotation Tools:

LanguageTool (http://languagetool.org)

Start date: 06/01/2012

End date: 06/30/2012

Size: 1,728 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Time Coverage

2009-2012 (1,728 Texts)

Geographic coverage

European Union (1,728 Texts)

Creation

Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.

Creation mode: Mixed

Original Sources

http://www.eso.org

Creation Tools

WebLign (http://code.google.c...)

Multilingual text corpusLanguages

Italian (4,247,000 Words)

Language Script: Latn

English (3,907,000 Words)

Language Script: Latn

German (3,788,000 Words)

Language Script: Latn

French (4,456,000 Words)

Language Script: Latn

Spanish (4,558,000 Words)

Language Script: Latn

Polish (3,581,000 Words)

Language Script: Latn

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)

Text Format

text/xml (67,787 Texts)

Size

24,539,000 Words

67,787 Texts

Character encoding

UTF - 8 (67,787 Texts)

Domains

science (67,787 Texts)

Modalities

Written Language (67,787 Texts)

AnnotationAlignment

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Theoretic Model: Church & Gale algorithm (Gale, William A.; Church, Kenneth W. (1993), "A Program for Aligning Sentences in Bilingual Corpora", Computational Linguistics 19 (1): 75–102)

Annotation Mode: Automatic

Annotation Tools:

mAligna (http://align.sourcef...)

Start date: 08/01/2011

End date: 06/30/2012

Size: 67,787 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Segmentation

StandOff: False

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI_P5

Annotation Mode: Automatic

Annotation Tools:

LanguageTool (http://languagetool.org)

Start date: 08/01/2011

End date: 06/30/2012

Size: 67,787 Texts

Annotators:

Łukasz Dróżdż

Piotr Pęzik

Time Coverage

2003-2012 (67,787 Texts)

Geographic coverage

European Union (67,787 Texts)

Creation

Creation mode details: The texts were acquired using a custom-built web crawler. Semi-automatic scripts were used to pipeline text cleanup, segmentation, alignment and import/export procedures.

Creation mode: Mixed

Original Sources