On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

Simon Vary; Tyler Farghly; Ilja Kuzborskij; Patrick Rebeschini

On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

Simon Vary, Tyler Farghly, Ilja Kuzborskij, Patrick Rebeschini

Abstract

We study trade-offs between the population risk curvature, geometry of the noise, and preconditioning on the generalisation ability of the multipass Preconditioned Stochastic Gradient Descent (PSGD). Many practical optimisation heuristics implicitly navigate this trade-off in different ways -- for instance, some aim to whiten gradient noise, while others aim to align updates with expected loss curvature. When the geometry of the population risk curvature and the geometry of the gradient noise do not match, an aggressive choice that improves one aspect can amplify instability along the other, leading to suboptimal statistical behavior. In this paper we employ on-average algorithmic stability to connect generalisation of PSGD to the effective dimension that depends on these sources of curvature. While existing techniques for on-average stability of SGD are limited to a single pass, as first contribution we develop a new on-average stability analysis for multipass SGD that handles the correlations induced by data reuse. This allows us to derive excess risk bounds that depend on the effective dimension. In particular, we show that an improperly chosen preconditioner can yield suboptimal effective dimension dependence in both optimisation and generalisation. Finally, we complement our upper bounds with matching, instance-dependent lower bounds.

On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

Abstract

Paper Structure (42 sections, 20 theorems, 154 equations, 1 figure)

This paper contains 42 sections, 20 theorems, 154 equations, 1 figure.

INTRODUCTION
Our contributions
Smooth strongly convex losses.
On-average stabiliy for smooth PL losses.
Lower bounds.
Notation and terminology.
Proof Sketch and Technical Challenges
Generalisation Geometry via On-Average Multipass Stability with Correlated Iterates
Spectral Alignment under Geometric Mismatch
Preliminaries
Relative smoothness & strong convexity.
Generalised co-coercivity.
Excess risk bounds of PSGD via on-average stability
On-average stability and risk bounds for strongly convex smooth losses
Risk bounds for non-convex losses under PL-property
...and 27 more sections

Key Result

lemma 1

[lemma]lemma:cocoercivity Let $f$ be $\alpha$-strongly convex and $\beta$-smooth w.r.t. $\| \cdot \|_H$ and $P$ is $C_{\ell, P}$-spectrally aligned with $\ell(\cdot,z)$, i.e., $\kappa(PH) < \rho_\ell^2$ in def:relcond_bound. Then for all $x, y \in \mathbb{R}^d$:

Figures (1)

Figure 1: Illustration of model misspecification. The geometry of the expected loss curvature $\nabla^2 f$ differs from the geometry of the gradient noise ($\Sigma$). While setting $P \approx \Sigma^{-1}$ whitens the noise, it may result in unstable updates along high-curvature directions.

Theorems & Definitions (41)

definition 1: Smoothness w.r.t. $\|\cdot\|_H$
definition 2: Strong convexity w.r.t. $\|\cdot\|_H$
definition 3: Spectrally aligned preconditioner
lemma 1: Co-coercivity of spectrally aligned PSGD updates
lemma 2
lemma 3: On-average parameter stability of PSGD
proposition 1: Risk bounds in geometry defined by $P^{-1}$)
remark 1: Approximate NGD under misspecification
proposition 2: Risk bounds in geometry defined by $H$
proposition 3: Excess risk bounds for PL-losses
...and 31 more

On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

Abstract

On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

Authors

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (41)