Table of Contents
Fetching ...

Shared Latent Space by Both Languages in Non-Autoregressive Neural Machine Translation

DongNyeong Heo, Heeyoul Choi

TL;DR

This work tackles the quality-speed trade-off in non-autoregressive NMT by introducing LadderNMT, a dual hierarchical latent variable model with a shared intermediate latent space across languages. By employing ladder inference to estimate a posterior over a shared Z without a separate posterior network, the method reduces parameters and mitigates one-sided posterior collapse, while fostering language-agnostic representations. Empirical results on WMT tasks show superior or comparable BLEU scores with substantially fewer parameters, and qualitative analyses (2-D visualizations and CCA) demonstrate that LadderNMT learns more aligned, language-agnostic latent spaces. The approach also yields robust performance gains when applied to both LaNMT and FullyNAT architectures, suggesting strong practical impact for fast and accurate NAT systems and potential extensions to multilingual and cross-modal tasks.

Abstract

Non-autoregressive neural machine translation (NAT) offers substantial translation speed up compared to autoregressive neural machine translation (AT) at the cost of translation quality. Latent variable modeling has emerged as a promising approach to bridge this quality gap, particularly for addressing the chronic multimodality problem in NAT. In the previous works that used latent variable modeling, they added an auxiliary model to estimate the posterior distribution of the latent variable conditioned on the source and target sentences. However, it causes several disadvantages, such as redundant information extraction in the latent variable, increasing the number of parameters, and a tendency to ignore some information from the inputs. In this paper, we propose a novel latent variable modeling that integrates a dual reconstruction perspective and an advanced hierarchical latent modeling with a shared intermediate latent space across languages. This latent variable modeling hypothetically alleviates or prevents the above disadvantages. In our experiment results, we present comprehensive demonstrations that our proposed approach infers superior latent variables which lead better translation quality. Finally, in the benchmark translation tasks, such as WMT, we demonstrate that our proposed method significantly improves translation quality compared to previous NAT baselines including the state-of-the-art NAT model.

Shared Latent Space by Both Languages in Non-Autoregressive Neural Machine Translation

TL;DR

This work tackles the quality-speed trade-off in non-autoregressive NMT by introducing LadderNMT, a dual hierarchical latent variable model with a shared intermediate latent space across languages. By employing ladder inference to estimate a posterior over a shared Z without a separate posterior network, the method reduces parameters and mitigates one-sided posterior collapse, while fostering language-agnostic representations. Empirical results on WMT tasks show superior or comparable BLEU scores with substantially fewer parameters, and qualitative analyses (2-D visualizations and CCA) demonstrate that LadderNMT learns more aligned, language-agnostic latent spaces. The approach also yields robust performance gains when applied to both LaNMT and FullyNAT architectures, suggesting strong practical impact for fast and accurate NAT systems and potential extensions to multilingual and cross-modal tasks.

Abstract

Non-autoregressive neural machine translation (NAT) offers substantial translation speed up compared to autoregressive neural machine translation (AT) at the cost of translation quality. Latent variable modeling has emerged as a promising approach to bridge this quality gap, particularly for addressing the chronic multimodality problem in NAT. In the previous works that used latent variable modeling, they added an auxiliary model to estimate the posterior distribution of the latent variable conditioned on the source and target sentences. However, it causes several disadvantages, such as redundant information extraction in the latent variable, increasing the number of parameters, and a tendency to ignore some information from the inputs. In this paper, we propose a novel latent variable modeling that integrates a dual reconstruction perspective and an advanced hierarchical latent modeling with a shared intermediate latent space across languages. This latent variable modeling hypothetically alleviates or prevents the above disadvantages. In our experiment results, we present comprehensive demonstrations that our proposed approach infers superior latent variables which lead better translation quality. Finally, in the benchmark translation tasks, such as WMT, we demonstrate that our proposed method significantly improves translation quality compared to previous NAT baselines including the state-of-the-art NAT model.
Paper Structure (21 sections, 13 equations, 4 figures, 2 tables)

This paper contains 21 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Graphical models of the conventional latent variable model (left) white circles represent observation and latent variables, respectively, while solid and dashed lines denote generation and inference processes, respectively. In contrast to prior approaches, we conceptualize the entire translation systems as dual reconstructions, with which we propose dual hierarchical latent variable model, so that the intermediate latent variable, $Z$, is inferred by the observation and the deeper latent variable (translated sentence). Furthermore, we share the intermediate latent variable across the reconstruction processes of both languages.
  • Figure 2: Illustrations of the training (left, dual reconstruction task) and testing (right, translation task) based on our proposed method, LadderNMT. We only illustrate the source-to-target translation task for the testing, the target-to-source translation task is done by the symmetric manner with the opposite model and input $Y$. $\mu$ terms represent each prior and posterior distributions. We omitted the element of variance parameters $\sigma$ from the prior and posterior distributions for simplicity. 'loc. att.' means the monotonic location-based attention mechanism that transforms the total length.
  • Figure 3: Latent variables of LadderNMT and LaNMT in 2-dimensional space by t-SNE and UMAP. Red and blue dots are latent variables from English and German sentences, respectively.
  • Figure 4: Relative sensitivity test result. Each number in the parenthesis next to the model label is the KL coefficient.