Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

Sachin Garg; Albert S. Berahas; Michał Dereziński

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

Sachin Garg, Albert S. Berahas, Michał Dereziński

TL;DR

This work addresses large-scale finite-sum convex optimization by integrating partial second-order information into a variance-reduced stochastic method. The Mb-SVRN algorithm combines SVRG-style gradient estimates with an $\alpha$-approximate Hessian oracle, proving high-probability linear convergence that is robust to gradient mini-batch size up to $b_{\max}=O\bigl(n/(\alpha\log n)\bigr)$. The key contribution is a martingale-based analysis that yields a rapid convergence rate $\rho\lesssim \alpha^2\kappa/n$ in the robust regime, plus a clear phase transition to Newton-type behavior beyond $b_{\max}$. Empirically, Mb-SVRN outperforms purely first-order methods in data-pass efficiency and remains stable across a wide range of $b$, $h$, and step sizes, demonstrating practical scalability for very large datasets while maintaining resilience to Hessian approximation quality.

Abstract

We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-order algorithm, called Mini-Batch Stochastic Variance-Reduced Newton ($\texttt{Mb-SVRN}$), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size $n$ is sufficiently large, i.e., $n\gg α^2κ$, where $κ$ is the condition number and $α$ is the Hessian approximation factor, then $\texttt{Mb-SVRN}$ achieves a fast linear convergence rate that is independent of the gradient mini-batch size $b$, as long $b$ is in the range between $1$ and $b_{\max}=O(n/(α\log n))$. Only after increasing the mini-batch size past this critical point $b_{\max}$, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of $\texttt{Mb-SVRN}$ remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point $b_{\max}$ on the Hessian approximation factor $α$ aligns with our theoretical predictions.

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

TL;DR

-approximate Hessian oracle, proving high-probability linear convergence that is robust to gradient mini-batch size up to

. The key contribution is a martingale-based analysis that yields a rapid convergence rate

in the robust regime, plus a clear phase transition to Newton-type behavior beyond

. Empirically, Mb-SVRN outperforms purely first-order methods in data-pass efficiency and remains stable across a wide range of

, and step sizes, demonstrating practical scalability for very large datasets while maintaining resilience to Hessian approximation quality.

Abstract

), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size

is sufficiently large, i.e.,

, where

is the condition number and

is the Hessian approximation factor, then

achieves a fast linear convergence rate that is independent of the gradient mini-batch size

, as long

is in the range between

and

. Only after increasing the mini-batch size past this critical point

, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of

remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point

on the Hessian approximation factor

aligns with our theoretical predictions.

Paper Structure (29 sections, 20 theorems, 164 equations, 6 figures, 1 algorithm)

This paper contains 29 sections, 20 theorems, 164 equations, 6 figures, 1 algorithm.

Introduction
Outline.
Main Convergence Result
Main algorithm and result
Discussion
Implications for stochastic gradient methods.
Implications for Newton-type methods.
Related Work
Convergence Analysis
Notation.
Auxiliary lemmas
One step expectation result
Building blocks for submartingale framework
Martingale setup
High probability convergence via martingale framework
...and 14 more sections

Key Result

Theorem 1

Suppose that assumption assumption holds and $n\gtrsim \alpha^2\kappa$. Then, in a local neighborhood around $\mathbf x^*$, Algorithm alg using Hessian $\alpha$-approximations and any gradient mini-batch size $b\lesssim \frac{n}{\alpha\log n}$, with $t_{\max}=n/b$ and optimally chosen $\eta$, satisf

Figures (6)

Figure 1: Convergence rate of Mb-SVRN (smaller is better, see Section \ref{['Experiments']}), as we vary gradient mini-batch size $b$ and Hessian sample size $h$, including the extreme cases of SN ($b\!=\!n$) and SVRG ($h\!=\!0$). The plot shows that after adding some second-order information (increasing $h$), the convergence rate of Mb-SVRN quickly becomes robust to gradient mini-batch size. On the other hand, the performance of SVRG rapidly degrades as we increase the gradient mini-batch size $b$, which ultimately turns it into simple gradient descent (GD).
Figure 2: Illustration of the Mb-SVRN convergence analysis from Theorem \ref{['main_result']} (for $n\gtrsim \alpha^2\kappa$), showing how the regime of robustness to gradient mini-batch size $b$ depends on the quality of the Hessian oracle (smaller $\alpha$ means better Hessian estimate). As we increase $b$ past $\frac{n}{\alpha\log n}$, the algorithm gradually transitions to a full-gradient Newton-type method.
Figure 3: Visual illustration of two problematic scenarios disrupting the submartingale behavior of $\left\lVert\Delta_t\right\rVert_{\mathbf{H}}^2$. The green line denotes the convergence guarantee we prove in our work (Theorem \ref{['final_result']}). In the left plot, the red dot ($$∙) represents the first stopping time obtained due to failure of the local neighborhood condition, plotted via the brown line near the top. In the right plot, the red dot ($$∙) represents the first stopping time obtained due the iterate $\mathbf x_t$ lying too close to $\mathbf x^*$, plotted via the blue line near the bottom. The teal colored dots ($$∙) denote the resume time representing the restoration of submartingale property.
Figure 4: Experiments on EMNIST, CIFAR10 and the synthetic dataset, as we vary gradient mini-batch size $b$ and Hessian sample size $h$, showing the robustness of Mb-SVRN to gradient mini-batch size and phase transition into standard Newton's method for large mini-batches.
Figure 5: Experiments with logistic regression on EMNIST on CIFAR10 datasets. The red dots on every curve mark the respective optimal convergence rate attained at the optimal step size. The bottom six plots demonstrate the performance of Mb-SVRN with different Hessian sample sizes $h$. The top two plots demonstrate SVRG's performance.
...and 1 more figures

Theorems & Definitions (28)

Definition 1: Gradient oracle
Definition 2: Hessian oracle
Theorem 1: Main result; informal Theorem \ref{['final_result']}
Remark 1
Definition 3
Lemma 1: Upper bound on variance of stochastic gradient
Lemma 2: Guaranteed error reduction for approximate Newton
Lemma 3: High-probability bound on stochastic gradient noise
Lemma 4: Mean Value Theorem
Theorem 2: One inner iteration conditional expectation result
...and 18 more

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

TL;DR

Abstract

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (28)