QT21 Domain Specific Human Error-Annotated data set
Training of Automatic Post-editing and Quality Estimation components / Quality Estimation / Error Analysis. Set of 14,000 domain specific Human Error-Annotated (HEA) quadruplets for 4 language pairs (English to German, English to Czech, English to Latvian, German to English) and 6 translation engines. Each quadruplet consists in (source, reference, HPE, HEA). The domain for En-De and En-Cz is IT, the domain for En-Lv and De-En is Pharma. This HEA data set is based on the HPE in section 3.3.15. A total of 6 translation engines have been used to produce the targets that have been post-edited: PBMT and NMT from KIT for En-De, PBMT from KIT for De-En, PBMT from CUNI for En-Cz and both PBMT and NMT system from Tilde for En-Lv. For each language pair, one unique set of source segments has been used as input to the different translation engines. From each translation engine, 2.000 target segments have been error-annotated. From each subset of 2.000 HEA segments, 200 are annotated by 2 different professional translator. En-De and De-En HEAs have been collected by professional translators from Text & Form. En-Lv HEAs have been collected by professional translators from Tilde. En-Cz HEAs have been collected by professional translators from Aspena. All data is provided by the EU project QT21 (http://www.qt21.eu/).