Table of Contents
Fetching ...

Optimization with Access to Auxiliary Information

El Mahdi Chayti, Sai Praneeth Karimireddy

TL;DR

This work introduces a general framework for stochastic optimization of a expensive-gradient target function $f$ by leveraging an auxiliary, cheaper gradient function $h$. It presents two algorithms, AuxMOM and AuxMVR, which combine biased gradient estimators from $h$ with momentum or variance-reduction to accelerate non-convex optimization under a Hessian similarity bound between $f$ and $h$. Theoretical results show convergence improvements over standard SGD when the Hessian similarity delta is small and auxiliary-noise is well correlated, with explicit rates and dependencies on problem parameters. Empirical evaluations across toy problems, rotated/mislabeled data, coresets, and semi-supervised logistic regression demonstrate that the proposed methods can robustly exploit auxiliary information to speed up training and improve generalization, particularly in decentralized or data-scarce settings.

Abstract

We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.

Optimization with Access to Auxiliary Information

TL;DR

This work introduces a general framework for stochastic optimization of a expensive-gradient target function by leveraging an auxiliary, cheaper gradient function . It presents two algorithms, AuxMOM and AuxMVR, which combine biased gradient estimators from with momentum or variance-reduction to accelerate non-convex optimization under a Hessian similarity bound between and . Theoretical results show convergence improvements over standard SGD when the Hessian similarity delta is small and auxiliary-noise is well correlated, with explicit rates and dependencies on problem parameters. Empirical evaluations across toy problems, rotated/mislabeled data, coresets, and semi-supervised logistic regression demonstrate that the proposed methods can robustly exploit auxiliary information to speed up training and improve generalization, particularly in decentralized or data-scarce settings.

Abstract

We investigate the fundamental optimization question of minimizing a target function , whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.
Paper Structure (34 sections, 12 theorems, 88 equations, 10 figures, 5 algorithms)

This paper contains 34 sections, 12 theorems, 88 equations, 10 figures, 5 algorithms.

Key Result

Lemma B.1

$\forall {\bm{a}},{\bm{b}}\in\mathbb{R}^d,\, c>0\: :\:\|{\bm{a}} + {\bm{b}}\|_2^2 \leq (1+c) \|{\bm{a}}\|_2^2 + (1+\dfrac{1}{c}) \|{\bm{b}}\|_2^2\, .$

Figures (10)

  • Figure 1: Effect of the bias $\zeta$ (zeta in the figure) on the naive approach (Naive), AuxMOM and Fine Tuning (FT) for $K=10$,$\delta=1$ and $\eta=\min(1/2, 1/(\delta K))$. We can see that the naive approach fails to converge for large bias values, whereas AuxMOM converges all the time, no matter the value of the bias. Fine Tuning converges much slower for small values of $\delta$, but beats AuxMOM for $\delta=10$.
  • Figure 2: Effect of the similarity $\delta$ (delta in the figure) on both the naive approach (Naive), AuxMOM and Fine Tuning for $K=10$,$\zeta=10$ and $\eta=0.5/(1+\delta)$ for Naive and AuxMOM and $\eta = 0.5/(1+\delta)$ then $eta = 0.5$ for Fine Tuning. We can see that the naive approach fails to benefit from small values of $\delta$; AuxMOM does not suffer from the same problem, whereas Fine Tuning is slower than AuxMOM.
  • Figure 3: Effect of $K$ ($K-1$ is the number of times we use the helper h) on the test accuracy of the main task (for an angle = 45). We can see that our approach, as our theory predicts, benefits from bigger values of $K$.
  • Figure 4: Test accuracy obtained using different angles as helpers, for $K=10$, step size $\eta=0.01$ and momentum parameter $a=0.1$. We see that, astonishingly, AuxMOM does not suffer much from the change in the angle, whereas, as expected, the bigger the angle, the worse the accuracy on the main task for the naive approach.
  • Figure 5: comparison of The Naive approach, AuxMOM, and Fine Tuning for an angle = 90. Again, we see that while not suffering from the added bias, Fine Tuning is slower than AuxMOM.
  • ...and 5 more figures

Theorems & Definitions (23)

  • Lemma B.1
  • proof
  • Lemma B.2
  • Lemma B.3
  • proof
  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Lemma C.3
  • ...and 13 more