ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
John Wieting, Kevin Gimpel
TL;DR
<3-5 sentence high-level summary> ParaNMT-50M introduces a dataset of over 50 million English paraphrase pairs created by back-translating the Czech side of CzEng. The authors train paraphrastic sentence embeddings on this resource, achieving state-of-the-art correlations on SemEval STS tasks without supervision and demonstrate paraphrase generation capabilities for data augmentation and grammar correction. They show that data source choice, filtering, and a mega-batching training regime significantly impact performance. The work also provides released resources (dataset, embeddings, code) to spur robust paraphrase-aware NLP across multiple downstream applications.
Abstract
We describe PARANMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-English side of a large parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M can be a valuable resource for paraphrase generation and can provide a rich source of semantic knowledge to improve downstream natural language understanding tasks. To show its utility, we use ParaNMT-50M to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.
