Table of Contents
Fetching ...

A Unified Approach to Controlling Implicit Regularization via Mirror Descent

Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, Navid Azizan

TL;DR

The paper investigates how optimization algorithms shape implicit regularization in over-parameterized models and proposes mirror descent (MD) with homogeneous potentials as a unifying mechanism to control this bias. It proves that, for separable linear classification and losses with exponential tails, MD converges in direction to a generalized max-margin direction with respect to the chosen potential, and it derives poly-log and accelerated rates under fixed and normalized step sizes, respectively. Extending beyond Euclidean geometry, the study shows that different potentials induce different implicit biases, and that normalized MD can significantly speed up convergence while preserving the bias. The authors validate the theory through linear and deep-network experiments, including MNIST, CIFAR-10, and ImageNet, demonstrating that MD with various potentials yields distinct regularizers and generalization behaviors. Overall, this work provides a broad, practically applicable framework for steering implicit regularization via geometry-aware mirror-descent updates, with implications for designing optimization methods that tailor generalization properties of learned models.

Abstract

Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their "preferred" solutions, a phenomenon commonly referred to as implicit regularization. In particular, it has been argued that gradient descent (GD) induces an implicit $\ell_2$-norm regularization in regression and classification problems. However, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems, indicating a gap in a general approach for controlling the implicit regularization. To address this, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Further, we show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions. Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances.

A Unified Approach to Controlling Implicit Regularization via Mirror Descent

TL;DR

The paper investigates how optimization algorithms shape implicit regularization in over-parameterized models and proposes mirror descent (MD) with homogeneous potentials as a unifying mechanism to control this bias. It proves that, for separable linear classification and losses with exponential tails, MD converges in direction to a generalized max-margin direction with respect to the chosen potential, and it derives poly-log and accelerated rates under fixed and normalized step sizes, respectively. Extending beyond Euclidean geometry, the study shows that different potentials induce different implicit biases, and that normalized MD can significantly speed up convergence while preserving the bias. The authors validate the theory through linear and deep-network experiments, including MNIST, CIFAR-10, and ImageNet, demonstrating that MD with various potentials yields distinct regularizers and generalization behaviors. Overall, this work provides a broad, practically applicable framework for steering implicit regularization via geometry-aware mirror-descent updates, with implications for designing optimization methods that tailor generalization properties of learned models.

Abstract

Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their "preferred" solutions, a phenomenon commonly referred to as implicit regularization. In particular, it has been argued that gradient descent (GD) induces an implicit -norm regularization in regression and classification problems. However, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems, indicating a gap in a general approach for controlling the implicit regularization. To address this, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Further, we show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions. Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances.
Paper Structure (62 sections, 19 theorems, 130 equations, 13 figures, 13 tables)

This paper contains 62 sections, 19 theorems, 130 equations, 13 figures, 13 tables.

Key Result

Lemma 2

For any $w \in \mathbb{R}^n$, the following identities hold for equ:mdFor convenience, for a function $f$, we write $D_f(x, y) := f(x) - f(y) - \left\langle\nabla f(y),~ x-y \right\rangle$. Note that when $f$ is convex, $D_f(\cdot, \cdot) \ge 0$, and when $f$ is strictly convex, $D_f(\cdot, \cdot)$

Figures (13)

  • Figure 1: The generalized maximum-margin solution to a single data point (denoted by $\bullet$) with respect to the $\ell_{1}, \ell_2$, and $\ell_{10}$ norms. For each generalized max-margin solution $u$, we plot the decision boundary $\{x \mid u^\top x = 0\}$.
  • Figure 2: An example of $p$-GD (MD with potential $\psi(\cdot) = \frac{1}{p} \left\lVert \cdot \right\rVert_p^p$) on randomly generated data with exponential loss and $p = 1.5, 2, 3$. (1) The left plot is a scatter plot of the data: $\times$'s and $\bullet$'s denote the two different labels ($y_i = \pm 1$). The dotted line is the $\ell_2$ max-margin classifier. For clarity, other $\ell_p$ max-margin classifiers are omitted from the plot. (2) The middle plot shows the rate which the quantity $D_{\psi}\left(u^{\sf \tiny r}_{p},w_t / \left\lVert w_t \right\rVert_t\right)$ converges to 0. (3) The right plot shows how fast the $p$-norm of $w_t$ grows. We can observe that the asymptotic behaviors of these plots are consistent with Corollary \ref{['thm:final-convg-rate']}.
  • Figure 3: An example of MD with potential $\psi(\cdot) = \frac{1}{p} \left\lVert \cdot \right\rVert_p^\beta$) on the same dataset as in Figure \ref{['fig:synthetic-data']}. To verify the conclusion of Corollary \ref{['thm:final-convg-rate']}, we plot the quantity $D_{\psi}\left(u^{\sf \tiny r}_{p},w_t / \left\lVert w_t \right\rVert_t\right)$ for $p = 2$ (left figure) and $p = 3$ (right figure). We see that the rate of convergence is faster for higher values of the exponent $\beta$, which is consistent with Corollary \ref{['thm:final-convg-rate']}.
  • Figure 4: An example of $p$-GD (MD with potential $\psi(\cdot) = \frac{1}{p} \left\lVert \cdot \right\rVert_p^p$) and normalized $p$-GD on randomly generated data with exponential loss and $p = 1.5$. (1) The left plot is the empirical loss. (2) The middle plot shows the rate which the quantity $D_{\psi}\left(u^{\sf \tiny r}_{p},w_t / \left\lVert w_t \right\rVert_t\right)$ converges to 0. (3) The right plot shows how fast the $p$-norm of $w_t$ grows.
  • Figure 5: Training loss of $p$-GD and normalized $p$-GD on the MNIST dataset and $p = 1.5$. (1) The left plot involves a fully connected network. (2) The right plot involves a conv-net.
  • ...and 8 more figures

Theorems & Definitions (27)

  • Example 1
  • Definition 1: Bregman divergence bregman1967relaxation
  • Lemma 2: \ref{['equ:md']} identity
  • Lemma 3
  • Lemma 4
  • Remark 5
  • Definition 6
  • Definition 7
  • Theorem 8: soudry2018implicit
  • Theorem 9: ji2020gradient
  • ...and 17 more