Table of Contents
Fetching ...

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao

TL;DR

The paper analyzes loss geometry in massively multilingual models by measuring gradient similarity across language-pair tasks, revealing that gradient alignment tracks language proximity and predicts cross-lingual transfer quality. It identifies limitations of existing gradient-based MTL methods and introduces Gradient Vaccine (GradVac), an adaptive gradient-alignment framework that sets target gradient similarities via EMA and per-layer granularity, generalizing PCGrad. Empirically, GradVac yields significant performance gains on large-scale multilingual NMT and XTREME benchmarks, demonstrating the practical value of geometry-aware optimization and suggesting applicability beyond multilingual scenarios.

Abstract

Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

TL;DR

The paper analyzes loss geometry in massively multilingual models by measuring gradient similarity across language-pair tasks, revealing that gradient alignment tracks language proximity and predicts cross-lingual transfer quality. It identifies limitations of existing gradient-based MTL methods and introduces Gradient Vaccine (GradVac), an adaptive gradient-alignment framework that sets target gradient similarities via EMA and per-layer granularity, generalizing PCGrad. Empirically, GradVac yields significant performance gains on large-scale multilingual NMT and XTREME benchmarks, demonstrating the practical value of geometry-aware optimization and suggesting applicability beyond multilingual scenarios.

Abstract

Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.

Paper Structure

This paper contains 27 sections, 10 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 2: Comparing gradient similarity versus model performance. (a): Similarity of model gradients between xx-en (left) and en-xx (right) language pairs in a single Any$\rightarrow$Any model. (b): BLEU scores on en-fr of a set of trilingual models versus their gradient similarities. Each model is trained on en-fr and another en-xx language pair.
  • Figure 3: Counts of active PCGrad (left) and GradVac (right) during the training process.
  • Figure 4: Evaluating gradient similarity across model architecture and training steps. (a): Difference between gradient similarities in the encoder and decoder. Positive value (darker) indicates the encoder has more similar gradient similarities. (b): Gradient similarities across layers. (c): Gradient similarities of different components and tasks across training steps.
  • Figure 5: Comparing PCGrad (left) with GradVac (right) in two cases. (a): For negative similarity, both methods are effective but GradVac can utilize adaptive objectives between different tasks. (b): For positive similarity, only GradVac is active while PCGrad stays "idle".
  • Figure 6: Comparing multilingual models with bilingual baselines on our dataset. Language pairs are listed in the order of training data sizes (high-resource languages on the left).
  • ...and 6 more figures