Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

Àlex R. Atrio; Alexis Allemann; Ljiljana Dolamic; Andrei Popescu-Belis

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

Àlex R. Atrio, Alexis Allemann, Ljiljana Dolamic, Andrei Popescu-Belis

TL;DR

The paper tackles how to sample minibatches across source languages in many-to-one MNMT under data imbalance. It introduces a self-paced dynamic scheduling method that uses a symmetric KL divergence between Transformer weight distributions, smoothed over time, to measure per-task competence and trigger language switches, aiming to allocate more training effort to harder languages. Across experiments on 8-to-1 setups with four LRL-HRL pairs, the approach provides insights into learning dynamics but generally does not surpass the strong baseline of multilingual shuffled batches, though it offers similar convergence behavior and low overhead; HRL warmup benefits are limited, and self-paced performance declines as task counts grow. The work contributes a precise, low-cost mechanism for interpreting model weight dynamics as a curriculum signal and highlights that uniform multilingual batching remains a robust default, with future work focusing on assembling multilingual batches via per-task weight variation. All mathematical expressions herein use $...$ delimiters for clarity and reproducibility, such as the per-task competence $C_c = D'_c(\theta_{j-1}, \theta_{j})$ and the smoothed divergence $D'_c(\theta_{t-1}, \theta_{t}) = (1-w) D(\theta_{t-1},\theta_{t}) + w D'_c(\theta_{t-k}, \theta_{t-1})$.

Abstract

Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

TL;DR

delimiters for clarity and reproducibility, such as the per-task competence

and the smoothed divergence

Abstract

Paper Structure (17 sections, 2 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 3 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Method for Self-Paced MNMT
Formulation and Implementation
Explorations of our Method
Weight Variation Metric
Setting of the Smoothing Weight
Importance of Previous Weight Variation
Training with HRL Warmup Steps
Amount of Task Switches and Balancing
Data and Systems
Corpora
Tokenization
System Architecture
Evaluation
...and 2 more sections

Figures (3)

Figure 1: Comparison between three different metrics for model weight variation: L2 norm, inverse cosine similarity, and KL divergence. For each of them we compare monitoring the average over all weight matrices and only the final output layer.
Figure 2: Evolution of the bidirectional Kullback-Leibler divergence for different values of the exponential smoothing coefficient $w$ in an experiment on Gl-Pt$\rightarrow$En with dynamic sampling
Figure 3: Amount of task switches and percentage of training on the LRL (task 1).

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

TL;DR

Abstract

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)