Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Rudrajit Das; Neel Patel; Meisam Razaviyayn; Vahab Mirrokni

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Rudrajit Das, Neel Patel, Meisam Razaviyayn, Vahab Mirrokni

Abstract

Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $T$ scales as $Θ(\log N)$ (resp., $Θ({(N \log N)}^{1/2})$) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Abstract

. We prove that the "greedy" practical approach of using

can fail even in a simple quadratic example. Under a fixed parameter update budget

and assuming the per-domain losses are strongly convex, we show that the optimal

scales as

(resp.,

) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.

Paper Structure (21 sections, 15 theorems, 166 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 21 sections, 15 theorems, 166 equations, 2 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Notation and Preliminaries
Problem Formulation
Motivating Example and Initial Insights
Main Results: Less is More
Practical Version of Algorithm \ref{['alg:1-main']} Using Approximate Hessian
Convergence Result for Algorithm \ref{['alg:practical']}
Extension to the Stochastic Case
Proof Sketch: Bounding the Hypergradient Error
Deterministic setting (\ref{['thm:main_convergence']}).
Stochastic setting (\ref{['thm:stoc-main']}).
Analysis without assuming convexity.
Empirical Evaluation
Conclusion
...and 6 more sections

Key Result

Proposition 4.1

For $t \ge 0$, $\frac{\partial \bm{\theta}_{k, t}}{\partial w_k^{(j)}}$ evolves as: with initial condition $\frac{\partial \bm{\theta}_{k, 0}}{\partial w_k^{(j)}} = \vec{0}$.

Figures (2)

Figure 1: Validation loss and the weight of the second domain (most aligned with validation data) $w_2$ as a function of the horizon $T$, for $N=\{1000,5000\}$. Note that the validation loss is lowest and $w_2$ is highest when $T$ is larger than$1$ and sublinear in $N$.
Figure 2: Validation loss, validation accuracy, and the weight of the second domain (most aligned with validation data) as a function of the horizon $T$, for $N = 1000$ and $5000$.

Theorems & Definitions (29)

Proposition 4.1
Theorem 5.1: Informal: Failure of Greedy Approach
Theorem 6.4
Remark 6.5
Theorem 6.7: Stochastic case
Remark 6.8
Theorem A.1
proof
Proposition B.1
proof
...and 19 more

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Abstract

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (29)