QT21 Domain Specific Human Error-Annotated data set



Training of Automatic Post-editing and Quality Estimation components / Quality Estimation / Error Analysis.
Set of 10,800 domain-specific Human Error-Annotated (HEA) tuplets for four language pairs and six translation engines. The set consists of 8,800 quadruplets made of (source, target, HPE, HEA) and 2,000 quintuplets which contain a second HEA produced by a different annotator i.e. (source, target, HPE, HEA1, HEA2). The domain for En-De and En-Cz is IT, the domain for En-Lv and De-En is Pharma. This HEA data set is based on the HPE in the QT21 Domain-Specific Human Post-Edited data set. A total of six translation engines have been used to produce the targets that have been post-edited: PBMT from KIT and NMT (Nematus) for En-De, PBMT from KIT for De-En, PBMT from CUNI for En-Cz and both PBMT and NMT systems from Tilde for En-Lv. For each language pair, one unique set of source segments has been used as input to the different translation engines. From each translation engine, 1,800 target segments have been error-annotated. From each subset of 1,800 HEA segments, 400 are annotated by two different professional translators (except for the two En-LV engines for which only 200 have been doubly annotated). En-De and De-En HEAs have been collected by professional translators from Text & Form. En-Lv HEAs have been collected by professional translators from Tilde. En-Cz HEAs have been collected by professional translators from Aspena.

