Table of Contents
Fetching ...

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Rudrajit Das, Neel Patel, Meisam Razaviyayn, Vahab Mirrokni

Abstract

Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $T$ scales as $Θ(\log N)$ (resp., $Θ({(N \log N)}^{1/2})$) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Abstract

Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps . We prove that the "greedy" practical approach of using can fail even in a simple quadratic example. Under a fixed parameter update budget and assuming the per-domain losses are strongly convex, we show that the optimal scales as (resp., ) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.
Paper Structure (21 sections, 15 theorems, 166 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 21 sections, 15 theorems, 166 equations, 2 figures, 2 tables, 3 algorithms.

Key Result

Proposition 4.1

For $t \ge 0$, $\frac{\partial \bm{\theta}_{k, t}}{\partial w_k^{(j)}}$ evolves as: with initial condition $\frac{\partial \bm{\theta}_{k, 0}}{\partial w_k^{(j)}} = \vec{0}$.

Figures (2)

  • Figure 1: Validation loss and the weight of the second domain (most aligned with validation data) $w_2$ as a function of the horizon $T$, for $N=\{1000,5000\}$. Note that the validation loss is lowest and $w_2$ is highest when $T$ is larger than$1$ and sublinear in $N$.
  • Figure 2: Validation loss, validation accuracy, and the weight of the second domain (most aligned with validation data) as a function of the horizon $T$, for $N = 1000$ and $5000$.

Theorems & Definitions (29)

  • Proposition 4.1
  • Theorem 5.1: Informal: Failure of Greedy Approach
  • Theorem 6.4
  • Remark 6.5
  • Theorem 6.7: Stochastic case
  • Remark 6.8
  • Theorem A.1
  • proof
  • Proposition B.1
  • proof
  • ...and 19 more