QT21 Domain Specific Human Post-Edited data set



Training of Automatic Post-editing and Quality Estimation components / Quality Estimation / Error Analysis. Set of 165,000 domain specific Human Post Edited (HPE) triplets for 4 language pairs (English to German, English to Czech, English to Latvian, German to English) and 6 translation engines. Each triplet consists in (source, reference, HPE). The domain for En-De and En-Cz is IT, the domain for En-Lv and De-En is Pharma. A total of 6 translation engines have been used to produce the targets that have been post edited: PBMT and NMT from KIT for En-De, PBMT from KIT for De-En, PBMT from CUNI for En-Cz and both PBMT and NMT system from Tilde for En-Lv. For each language pair, one unique set of source segments has been used as input to the different translation engines. Each translation engine has provided 30,000 target segments except for the two En-Lv engines which have provided 22,500 target segments each. En-De and De-En HPEs have been collected by professional translators from Text&Form. En-Lv HPEs have been collected by professional translators from Tilde. En-Cz HPEs have been collected by professional translators from Traductera. All data is provided by the EU project QT21 (http://www.qt21.eu/).

IMPORTANT LEGAL NOTICE (This dataset is provided under the following terms of use)
TAUS Terms of Use (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21).
TAUS grants to QT21 User access to the WMT Data Set with the following rights:
i) the right to use the target side of the translation units into a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is its own new translation;
ii) the right to make Derivative Works; and
iii) the right to use or resell such Derivative Works commercially and for the following goals:
i) research and benchmarking;
ii) piloting new solutions; and
iii) testing of new commercial services.

