WMT 2018 Automatic Post-Editing data set
For the APE shared task at WMT2018**, we will use:
- A new test set of 2,000 segments for the English-German language pair from 2017 (en-de and de-en), where the MT segments are generated by the SMT system. In total this language pair covers 30k segments. The split is: training set = 23 k, dev set = 1k, test-set16 = 2k, test-set17 = 2k, test-set18= 2k.
- A new English-German dataset of 30,000 segments where the MT segments are generated by a NMT system. The split is: training set = 27k, dev set = 1k, test-set18 = 1k.
The SMT English-German test data consists of 2,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT SMT system. Post-edits are collected by Text & Form from professional translators.
The NMT English-German data consists of 30,000 triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the Nematus system. Post-edits are collected by Text & Form from professional translators
** The 2018 APE data sets will be made available end of June 2018!
TAUS grants to QT21 User access to the WMT Data Set with the following rights:
i) the right to use the target side of the translation units into a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is its own new translation;
ii) the right to make Derivative Works; and
iii) the right to use or resell such Derivative Works commercially and for the following goals:
i) research and benchmarking;
ii) piloting new solutions; and
iii) testing of new commercial services.
People who looked at this resource also viewed the following: