Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Tianjian Li; Haoran Xu; Weiting Tan; Kenton Murray; Daniel Khashabi

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi

TL;DR

The paper addresses the challenge of heavily imbalanced domain data in multilingual NLP by comparing two balancing strategies: Scalarization (loss reweighting) and Temperature Sampling (over-sampling low-resource domains). It proves they are equivalent under full-gradient descent but diverge under stochastic optimization due to gradient-variance differences, with Temperature Sampling exhibiting lower variance and faster convergence; however, large temperatures can cause overfitting. To leverage the strengths of both approaches, the authors propose Cooldown, a dynamic temperature schedule that starts with a high temperature to accelerate convergence and gradually lowers it to prevent overfitting. Empirical results on multilingual machine translation and multilingual language modeling show Cooldown achieving superior or competitive performance relative to static and dynamic baselines while remaining computationally efficient. These findings offer a principled, practical pathway to balance data mixture in heavily imbalanced multilingual settings and guide future work on temperature schedules for domain-balanced learning.

Abstract

Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation compared to Scalarization, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting -- achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

TL;DR

Abstract

Paper Structure (40 sections, 4 theorems, 17 equations, 8 figures, 6 tables)

This paper contains 40 sections, 4 theorems, 17 equations, 8 figures, 6 tables.

Introduction
Preliminaries
Notations and Task Description
Scalarization (S)
Temperature Sampling (TS)
Temperature Sampling v.s. Scalarization
Theoretical Analysis
Implication of Theorem 2
Implication of Theorem 3
Empirical Evidence
Empirical Validation of Theorem 2
Large Temperature Sampling is prone to overfitting
Temperature Sampling is equivalent to Scalarization given enough compute
Cooldown: Balanced Training for Heavily Imbalanced Datasets
Our proposed method:
...and 25 more sections

Key Result

Theorem 1

For any sampling temperature $\tau$, there exists a set of weights $\mathbf{w}_\tau = \{w_1, w_2, ..., w_K\}$ for the Scalarization loss such that this loss is equivalent to the Temperature Sampling loss, both computed based on the whole data $\mathcal{D}$.

Figures (8)

Figure 1: Validation loss by training iteration of a low-resource language pair (En-Ro) in multilingual machine translation. Proportional sampling leads to underfitting the low-resource direction. Using a high temperature (oversampling LRLs) leads to overfitting the low-resource direction. Employing a high temperature at the beginning and then decreasing the temperature (Cooldown) gets the advantage of fast convergence without overfitting.
Figure 2: Variance of Scalarization $\sum_i \frac{p(i; \tau)^2}{p(i; 1)}$ by sampling temperature $\tau$. A large temperature or a skewed distribution of $\mathcal{D}$ induces a much larger variance for Scalarization. Distributions $\mathcal{D}_i \propto \frac{1}{i^\alpha}$. See Appendix \ref{['proof_of_theorem_3']} for details of the experiment setup.
Figure 3: The distribution of gradient norm between mini-batches on En-{Cs, Ro} for Temperature Sampling and Scalarization. Scalarization induces a larger variance (2.25 $>$ 0.62) between mini-batch gradient norms compared to Temperature Sampling, as indicated by Theorem \ref{['theorem:3']}.
Figure 4: Validation loss by training iteration for En-{Cs, Ro} (first row) and En-{Fr, Ro} (second row). Temperature Sampling (dashed) converges faster compared to Scalarization (solid), leading to better performance on both the HRL and the LRL.
Figure 5: Sampling temperature schedules.
...and 3 more figures

Theorems & Definitions (9)

Theorem 1: Equivalency under Gradient Descent
proof
Corollary 1.1
proof
Theorem 2: Scalarization induces larger variance under Stochastic Gradient Descent
Theorem 3: Scalarization induces larger variance when approximating higher temperatures
proof
proof
proof

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

TL;DR

Abstract

Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (9)