Table of Contents
Fetching ...

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

Àlex R. Atrio, Alexis Allemann, Ljiljana Dolamic, Andrei Popescu-Belis

TL;DR

The paper tackles how to sample minibatches across source languages in many-to-one MNMT under data imbalance. It introduces a self-paced dynamic scheduling method that uses a symmetric KL divergence between Transformer weight distributions, smoothed over time, to measure per-task competence and trigger language switches, aiming to allocate more training effort to harder languages. Across experiments on 8-to-1 setups with four LRL-HRL pairs, the approach provides insights into learning dynamics but generally does not surpass the strong baseline of multilingual shuffled batches, though it offers similar convergence behavior and low overhead; HRL warmup benefits are limited, and self-paced performance declines as task counts grow. The work contributes a precise, low-cost mechanism for interpreting model weight dynamics as a curriculum signal and highlights that uniform multilingual batching remains a robust default, with future work focusing on assembling multilingual batches via per-task weight variation. All mathematical expressions herein use $...$ delimiters for clarity and reproducibility, such as the per-task competence $C_c = D'_c(\theta_{j-1}, \theta_{j})$ and the smoothed divergence $D'_c(\theta_{t-1}, \theta_{t}) = (1-w) D(\theta_{t-1},\theta_{t}) + w D'_c(\theta_{t-k}, \theta_{t-1})$.

Abstract

Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

TL;DR

The paper tackles how to sample minibatches across source languages in many-to-one MNMT under data imbalance. It introduces a self-paced dynamic scheduling method that uses a symmetric KL divergence between Transformer weight distributions, smoothed over time, to measure per-task competence and trigger language switches, aiming to allocate more training effort to harder languages. Across experiments on 8-to-1 setups with four LRL-HRL pairs, the approach provides insights into learning dynamics but generally does not surpass the strong baseline of multilingual shuffled batches, though it offers similar convergence behavior and low overhead; HRL warmup benefits are limited, and self-paced performance declines as task counts grow. The work contributes a precise, low-cost mechanism for interpreting model weight dynamics as a curriculum signal and highlights that uniform multilingual batching remains a robust default, with future work focusing on assembling multilingual batches via per-task weight variation. All mathematical expressions herein use delimiters for clarity and reproducibility, such as the per-task competence and the smoothed divergence .

Abstract

Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.
Paper Structure (17 sections, 2 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between three different metrics for model weight variation: L2 norm, inverse cosine similarity, and KL divergence. For each of them we compare monitoring the average over all weight matrices and only the final output layer.
  • Figure 2: Evolution of the bidirectional Kullback-Leibler divergence for different values of the exponential smoothing coefficient $w$ in an experiment on Gl-Pt$\rightarrow$En with dynamic sampling
  • Figure 3: Amount of task switches and percentage of training on the LRL (task 1).