Characterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro
TL;DR
The paper investigates how optimization geometry shapes the implicit bias of common algorithms when fitting underdetermined linear models and separable classifiers. It distinguishes losses with a unique finite root from strictly monotone losses and derives, where possible, parameter- and hyperparameter-independent characterizations: MD converges to a minimum-divergence solution; NGD reduces to MD in the infinitesimal-step limit; steepest descent yields max-margin directions within the unit ball for monotone losses; and matrix-factorization introduces a nuclear-norm bias under monotone losses. It also reveals that AdaGrad can retain initialization-dependent bias even for monotone losses, and provides detailed proofs and extensions in appendices. Overall, the work clarifies when and how optimization geometry dictates the global minima reached by popular algorithms, offering a principled lens for understanding inductive biases in linear and simple structured models. This sets the stage for extending these ideas to more complex models and nonconvex settings.
Abstract
We study the implicit bias of generic optimization methods, such as mirror descent, natural gradient descent, and steepest descent with respect to different potentials and norms, when optimizing underdetermined linear regression or separable linear classification problems. We explore the question of whether the specific global minimum (among the many possible global minima) reached by an algorithm can be characterized in terms of the potential or norm of the optimization geometry, and independently of hyperparameter choices such as step-size and momentum.
