Table of Contents
Fetching ...

Semi-Supervised Learning for Neural Machine Translation

Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, Yang Liu

TL;DR

The paper tackles the scarcity of parallel data for neural machine translation by introducing a semi-supervised framework that leverages monolingual corpora through autoencoders built from bidirectional translation models.It combines supervised translation on parallel data with reconstruction terms on monolingual data, using top-$k$ approximations to make training tractable and enable joint optimization of source-to-target and target-to-source models.Empirical results on Chinese–English NIST data show substantial improvements over both SMT and standard NMT baselines, with larger gains when target-language monolingual data is used and with careful control of OOV ratios.The work demonstrates the potential of leveraging monolingual data in both languages to improve translation quality and provides a flexible, architecture-agnostic approach that can extend to other language pairs and NMT systems.

Abstract

While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.

Semi-Supervised Learning for Neural Machine Translation

TL;DR

The paper tackles the scarcity of parallel data for neural machine translation by introducing a semi-supervised framework that leverages monolingual corpora through autoencoders built from bidirectional translation models.It combines supervised translation on parallel data with reconstruction terms on monolingual data, using top-$k$ approximations to make training tractable and enable joint optimization of source-to-target and target-to-source models.Empirical results on Chinese–English NIST data show substantial improvements over both SMT and standard NMT baselines, with larger gains when target-language monolingual data is used and with careful control of OOV ratios.The work demonstrates the potential of leveraging monolingual data in both languages to improve translation quality and provides a flexible, architecture-agnostic approach that can extend to other language pairs and NMT systems.

Abstract

While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.

Paper Structure

This paper contains 16 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of (a) source autoencoder and (b) target autoencoder on monolingual corpora. Our idea is to leverage autoencoders to exploit monolingual corpora for NMT. In a source autoencoder, the source-to-target model $P(\mathbf{y}|\mathbf{x};\overrightarrow{\bm{\theta}})$ serves as an encoder to transform the observed source sentence $\mathbf{x}$ into a latent target sentence $\mathbf{y}$ (highlighted in grey), from which the target-to-source model $P(\mathbf{x}'|\mathbf{y}; \overleftarrow{\bm{\theta}})$ reconstructs a copy of the observed source sentence $\mathbf{x}'$ from the latent target sentence. As a result, monolingual corpora can be combined with parallel corpora to train bidirectional NMT models in a semi-supervised setting.
  • Figure 2: Effect of sample size $k$ on the Chinese-to-English validation set.
  • Figure 3: Effect of sample size $k$ on the English-to-Chinese validation set.
  • Figure 4: Effect of OOV ratio on the Chinese-to-English validation set.
  • Figure 5: Effect of OOV ratio on the English-to-Chinese validation set.