Optimization with Access to Auxiliary Information

El Mahdi Chayti; Sai Praneeth Karimireddy

Optimization with Access to Auxiliary Information

El Mahdi Chayti, Sai Praneeth Karimireddy

TL;DR

This work introduces a general framework for stochastic optimization of a expensive-gradient target function $f$ by leveraging an auxiliary, cheaper gradient function $h$. It presents two algorithms, AuxMOM and AuxMVR, which combine biased gradient estimators from $h$ with momentum or variance-reduction to accelerate non-convex optimization under a Hessian similarity bound between $f$ and $h$. Theoretical results show convergence improvements over standard SGD when the Hessian similarity delta is small and auxiliary-noise is well correlated, with explicit rates and dependencies on problem parameters. Empirical evaluations across toy problems, rotated/mislabeled data, coresets, and semi-supervised logistic regression demonstrate that the proposed methods can robustly exploit auxiliary information to speed up training and improve generalization, particularly in decentralized or data-scarce settings.

Abstract

We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.

Optimization with Access to Auxiliary Information

TL;DR

This work introduces a general framework for stochastic optimization of a expensive-gradient target function

by leveraging an auxiliary, cheaper gradient function

. It presents two algorithms, AuxMOM and AuxMVR, which combine biased gradient estimators from

with momentum or variance-reduction to accelerate non-convex optimization under a Hessian similarity bound between

and

. Theoretical results show convergence improvements over standard SGD when the Hessian similarity delta is small and auxiliary-noise is well correlated, with explicit rates and dependencies on problem parameters. Empirical evaluations across toy problems, rotated/mislabeled data, coresets, and semi-supervised logistic regression demonstrate that the proposed methods can robustly exploit auxiliary information to speed up training and improve generalization, particularly in decentralized or data-scarce settings.

Abstract

We investigate the fundamental optimization question of minimizing a target function

, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function

whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.

Paper Structure (34 sections, 12 theorems, 88 equations, 10 figures, 5 algorithms)

This paper contains 34 sections, 12 theorems, 88 equations, 10 figures, 5 algorithms.

Introduction
Motivation.
General Framework
Algorithms and Results
Results
Naive approach
Momentum based approach
MVR based approach
Potential applications
Experiments
Toy example
Leveraging noisy or mislabeled data
Training with Coresets
Semi-supervised logistic regression
Discussion
...and 19 more sections

Key Result

Lemma B.1

$\forall {\bm{a}},{\bm{b}}\in\mathbb{R}^d,\, c>0\: :\:\|{\bm{a}} + {\bm{b}}\|_2^2 \leq (1+c) \|{\bm{a}}\|_2^2 + (1+\dfrac{1}{c}) \|{\bm{b}}\|_2^2\, .$

Figures (10)

Figure 1: Effect of the bias $\zeta$ (zeta in the figure) on the naive approach (Naive), AuxMOM and Fine Tuning (FT) for $K=10$,$\delta=1$ and $\eta=\min(1/2, 1/(\delta K))$. We can see that the naive approach fails to converge for large bias values, whereas AuxMOM converges all the time, no matter the value of the bias. Fine Tuning converges much slower for small values of $\delta$, but beats AuxMOM for $\delta=10$.
Figure 2: Effect of the similarity $\delta$ (delta in the figure) on both the naive approach (Naive), AuxMOM and Fine Tuning for $K=10$,$\zeta=10$ and $\eta=0.5/(1+\delta)$ for Naive and AuxMOM and $\eta = 0.5/(1+\delta)$ then $eta = 0.5$ for Fine Tuning. We can see that the naive approach fails to benefit from small values of $\delta$; AuxMOM does not suffer from the same problem, whereas Fine Tuning is slower than AuxMOM.
Figure 3: Effect of $K$ ($K-1$ is the number of times we use the helper h) on the test accuracy of the main task (for an angle = 45). We can see that our approach, as our theory predicts, benefits from bigger values of $K$.
Figure 4: Test accuracy obtained using different angles as helpers, for $K=10$, step size $\eta=0.01$ and momentum parameter $a=0.1$. We see that, astonishingly, AuxMOM does not suffer much from the change in the angle, whereas, as expected, the bigger the angle, the worse the accuracy on the main task for the naive approach.
Figure 5: comparison of The Naive approach, AuxMOM, and Fine Tuning for an angle = 90. Again, we see that while not suffering from the added bias, Fine Tuning is slower than AuxMOM.
...and 5 more figures

Theorems & Definitions (23)

Lemma B.1
proof
Lemma B.2
Lemma B.3
proof
Lemma C.1
proof
Lemma C.2
proof
Lemma C.3
...and 13 more

Optimization with Access to Auxiliary Information

TL;DR

Abstract

Optimization with Access to Auxiliary Information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (23)