Table of Contents
Fetching ...

Characterizing Implicit Bias in Terms of Optimization Geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro

TL;DR

The paper investigates how optimization geometry shapes the implicit bias of common algorithms when fitting underdetermined linear models and separable classifiers. It distinguishes losses with a unique finite root from strictly monotone losses and derives, where possible, parameter- and hyperparameter-independent characterizations: MD converges to a minimum-divergence solution; NGD reduces to MD in the infinitesimal-step limit; steepest descent yields max-margin directions within the unit ball for monotone losses; and matrix-factorization introduces a nuclear-norm bias under monotone losses. It also reveals that AdaGrad can retain initialization-dependent bias even for monotone losses, and provides detailed proofs and extensions in appendices. Overall, the work clarifies when and how optimization geometry dictates the global minima reached by popular algorithms, offering a principled lens for understanding inductive biases in linear and simple structured models. This sets the stage for extending these ideas to more complex models and nonconvex settings.

Abstract

We study the implicit bias of generic optimization methods, such as mirror descent, natural gradient descent, and steepest descent with respect to different potentials and norms, when optimizing underdetermined linear regression or separable linear classification problems. We explore the question of whether the specific global minimum (among the many possible global minima) reached by an algorithm can be characterized in terms of the potential or norm of the optimization geometry, and independently of hyperparameter choices such as step-size and momentum.

Characterizing Implicit Bias in Terms of Optimization Geometry

TL;DR

The paper investigates how optimization geometry shapes the implicit bias of common algorithms when fitting underdetermined linear models and separable classifiers. It distinguishes losses with a unique finite root from strictly monotone losses and derives, where possible, parameter- and hyperparameter-independent characterizations: MD converges to a minimum-divergence solution; NGD reduces to MD in the infinitesimal-step limit; steepest descent yields max-margin directions within the unit ball for monotone losses; and matrix-factorization introduces a nuclear-norm bias under monotone losses. It also reveals that AdaGrad can retain initialization-dependent bias even for monotone losses, and provides detailed proofs and extensions in appendices. Overall, the work clarifies when and how optimization geometry dictates the global minima reached by popular algorithms, offering a principled lens for understanding inductive biases in linear and simple structured models. This sets the stage for extending these ideas to more complex models and nonconvex settings.

Abstract

We study the implicit bias of generic optimization methods, such as mirror descent, natural gradient descent, and steepest descent with respect to different potentials and norms, when optimizing underdetermined linear regression or separable linear classification problems. We explore the question of whether the specific global minimum (among the many possible global minima) reached by an algorithm can be characterized in terms of the potential or norm of the optimization geometry, and independently of hyperparameter choices such as step-size and momentum.

Paper Structure

This paper contains 34 sections, 25 theorems, 81 equations, 1 figure.

Key Result

Theorem 1

For any loss $\ell$ with a unique finite root (Property ass:finite-root), any realizable dataset $\{x_{n},y_{n}\}_{n=1}^N$, and any strongly convex potential $\psi$, consider the mirror descent iterates ${w_{(t)}}$ from eq. eq:md-upd-opt for minimizing the empirical loss $\mathcal{L}(w)$ in eq. eq:l

Figures (1)

  • Figure 1: Dependence of implicit bias on step-size and momentum: In $(a)$--$(c)$, the blue line denotes the set $\mathcal{G}$ of global minima for the respective examples. In $(a)$ and $(b)$, $\psi$ is the entropy potential and all algorithms are initialized with ${{w_{(0)}}}=[1,1]$ so that $\psi({{w_{(0)}}})=\mathop{\mathrm{\arg\!\min}}\limits_{w}\psi(w)$. $w^*_\psi=\mathop{\mathrm{\arg\!\min}}\limits_{\psi\in\mathcal{G}}\psi(w)$ denotes the minimum potential global minima we expect to converge to. $(a)$Mirror descent with primal momentum (Example \ref{['ex:md']}): the global minimum that eq. \ref{['eq:primal-mom']} converges to depends on the momentum parameters---the sub-plots contain the trajectories of eq. \ref{['eq:primal-mom']} for different choices of $\beta_t=\beta$ and $\gamma_t=\gamma$. $(b)$Natural gradient descent (Example \ref{['ex:ngd']}): for different step-sizes $\eta_t=\eta$, eq. \ref{['eq:ngd-update']} converges to different global minima. Here, $\eta$ was chosen to be small enough to ensure ${w_{(t)}}\in\text{dom}(\psi)$. (c) Steepest descent w.r.t $\|.\|_{4/3}$ (Example \ref{['ex:sd']}): the global minimum to which eq. \ref{['eq:sd-update']} converges to depends on $\eta$. Here ${{w_{(0)}}}=[0,0,0]$, $w^*_{\|.\|}=\mathop{\mathrm{\arg\!\min}}\limits_{\psi\in\mathcal{G}}\|w\|_{4/3}$ denotes the minimum norm global minimum, and $w^\infty_{\eta\to0}$ denotes the solution of infinitesimal SD with $\eta\to0$. Note that even as $\eta\to 0$, the expected characterization does not hold, i.e., $w^\infty_{\eta\to0}\neq w^*_{\|.\|}$.

Theorems & Definitions (44)

  • Theorem 1
  • Theorem 1a
  • Theorem 1b
  • Remark 1
  • Example 2
  • Proposition \theexamplea
  • Example 3
  • Proposition \theexamplea
  • Example 4
  • Theorem 5
  • ...and 34 more