Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi
TL;DR
The paper addresses the challenge of heavily imbalanced domain data in multilingual NLP by comparing two balancing strategies: Scalarization (loss reweighting) and Temperature Sampling (over-sampling low-resource domains). It proves they are equivalent under full-gradient descent but diverge under stochastic optimization due to gradient-variance differences, with Temperature Sampling exhibiting lower variance and faster convergence; however, large temperatures can cause overfitting. To leverage the strengths of both approaches, the authors propose Cooldown, a dynamic temperature schedule that starts with a high temperature to accelerate convergence and gradually lowers it to prevent overfitting. Empirical results on multilingual machine translation and multilingual language modeling show Cooldown achieving superior or competitive performance relative to static and dynamic baselines while remaining computationally efficient. These findings offer a principled, practical pathway to balance data mixture in heavily imbalanced multilingual settings and guide future work on temperature schedules for domain-balanced learning.
Abstract
Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation compared to Scalarization, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting -- achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.
