WMT 2017 Automatic Post-editing and Quality Estimation data set
For WMT 2017, 11,000 segments have been added to the WMT16 training set (En-De) together with a new test (for 2017) made of 2,000 segments (En-De). In 2017 a new language pair has been added: De-En with 25k segments for training, 1k segments for dev, 2k segments for test. Adding the 2016 and 2017 APE and QE data together, we obtain, for each language pair a total of 28k segments each, split in: En-De: training set = 23 k, dev set = 1k , test-set16 = 2k, test-set17 = 2k, De-En: training set: 25k, dev-set = 1k, test-set17= 2k Training, development and text data consist of English-German triplets (source, target and post-edit) belonging to the Information Technology domain and already tokenised. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators. Training, development and text data consist of German-English triplets (source, target and post-edit) belonging to the Pharma domain and already tokenised. Target sentences are machine-translated with the KIT system. Post-edits are collected by Text & Form from professional translators.