These are the data sets for the MT tasks of the evaluation campaigns of IWSLT. They are parallel data sets used for building and testing MT systems. They are publicly available through the WIT3 website wit3.fbk.eu, see release: 2017-01
IWSLT 2017: • multilingual: German, English, Italian, Dutch, Romanian • bilingual: from/to English to/from Arabic, German, French, Japanese, Korean, Chinese
Data are crawled from the TED website and carry the respective licensing conditions. Approximately, for each language pair, training sets include 2,000 talks, 200K sentences and 4M tokens per side, while each dev and test sets 10-15 talks, 1.0K-1.5K sentences and 20K-30K tokens per side. In each edition, the training sets of previous editions are re-used and updated with new talks added to the TED repository in the meanwhile.