PELCRA Polish-English parallel corpora (CC-BY)

81 Last view: 2024-12-08

PELCRA Polish-English parallel corpora (CC-BY)

PELCRA-PAR-1

http://pelcra.pl/res/parallel/pelcra-par-1

ID:

501 A subset of the PELCRA Polish parallel corpora licensed under the CC-BY license. This resource contains 10268 texts from the CORDIS website, 23319 texts from the JRC-Acquis and 4740 texts from the RAPID site. Individual headers may override the licensing information.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 11/30/2011

Licence

CC - BY

Restrictions: Attribution

Download location: hidden

Distribution Access/Medium: Downloadable

Licensors:

Piotr Pęzik

Distribution rights holders:

IPR Holder

Contact Persons

text

Bilingual text corpusLanguages

Polish (3,800,000 Tokens) English (4,200,000 Tokens)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)

Text Format

text/xml (8,000,000 Tokens)

Size

8,000,000 Tokens

Character encoding

UTF - 8 (8,000,000 Tokens)

Domains

law_politics (8,000,000 Tokens)

Modalities

Other

AnnotationSegmentation

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Mixed

Annotation Tools:

in-house software

Start date: 08/01/2011

End date: 09/30/2011

Size: 8,000,000 Tokens

Annotators:

Dróżdż Łukasz

Piotr Pęzik

Alignment

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic

Annotation Tools:

Maligna

Start date: 08/01/2011

End date: 09/30/2011

Size: 8,000,000 Tokens

Annotators:

Dróżdż Łukasz

Piotr Pęzik

Time Coverage

2004-2011 (8,000,000 Tokens)

Geographic coverage

European Union (8,000,000 Tokens)

Creation

Creation mode details: Semi-automatic acquisition and processing.

Creation mode: Mixed

Original Sources

http://europa.eu/rapid/

Creation Tools

in-house software

Bilingual text corpusLanguages

Polish (28,600,000 Tokens) English (32,400,000 Tokens)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel (The texts were aligned by the JRC (http://optima.jrc.it) on a sentence level using the statistical aligner hunAlign (http://mokk.bme.hu/resources/hunalign/).)

Text Format

text/xml (61,000,000 Tokens)

Size

61,000,000 Tokens

Character encoding

UTF - 8 (61,000,000 Tokens)

Domains

law_politics (61,000,000 Tokens)

Modalities

Other

AnnotationSegmentation

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Mixed

Annotation Tools:

unknown

Start date: 05/29/2008

End date: 05/29/2008

Size: 61,000,000 Tokens

Annotators:

Ralf Steinberger

Alignment

Segmentation level: Sentence

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic

Annotation Tools:

hunAlign

Start date: 05/29/2008

End date: 05/29/2008

Size: 61,000,000 Tokens

Annotators:

Ralf Steinberger

Time Coverage

1958-2006 (61,000,000 Tokens)

Geographic coverage

European Union (61,000,000 Tokens)

Creation

Creation mode details: Semi-automatic acquisition and processing.

Creation mode: Mixed

Original Sources

http://optima.jrc.it...

Creation Tools

in-house software

Bilingual text corpusLanguages

Polish (3,200,000 Tokens) English (3,500,000 Tokens)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel (The texts were aligned on a sentence level using the statistical aligner Maligna (http://align.sourceforge.net).)

Text Format

text/xml (6,700,000 Tokens)

Size

6,700,000 Tokens

Character encoding

UTF - 8 (6,700,000 Tokens)

Domains

science (6,700,000 Tokens)