Table of Contents
Fetching ...

Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter L. Bartlett

TL;DR

This work extends the theory of gradient descent implicit bias from homogeneous networks to a broad class of non-homogeneous deep nets by introducing a near-homogeneity framework and a strong separability condition. It proves that gradient flow induces a nearly monotone normalized margin, with iterates diverging in norm but converging in direction, and that the limiting direction satisfies the KKT conditions of a margin-maximization problem for the homogenized network. The results apply to architectures with residual connections and non-homogeneous activations, and they extend to gradient descent with large steps via corresponding margin- and homogenization-based arguments, supported by a two-layer network example. Overall, the paper resolves an open problem by showing that implicit bias phenomena extend beyond homogeneous networks under natural near-homogeneity assumptions, shaping our understanding of GD generalization in realistic architectures.

Abstract

We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).

Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

TL;DR

This work extends the theory of gradient descent implicit bias from homogeneous networks to a broad class of non-homogeneous deep nets by introducing a near-homogeneity framework and a strong separability condition. It proves that gradient flow induces a nearly monotone normalized margin, with iterates diverging in norm but converging in direction, and that the limiting direction satisfies the KKT conditions of a margin-maximization problem for the homogenized network. The results apply to architectures with residual connections and non-homogeneous activations, and they extend to gradient descent with large steps via corresponding margin- and homogenization-based arguments, supported by a two-layer network example. Overall, the paper resolves an open problem by showing that implicit bias phenomena extend beyond homogeneous networks under natural near-homogeneity assumptions, shaping our understanding of GD generalization in realistic architectures.

Abstract

We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed by Ji and Telgarsky (2020).

Paper Structure

This paper contains 69 sections, 90 theorems, 550 equations.

Key Result

Lemma 3.1

Let $f$ be such that where $f^{(i)}(\bm{\theta};\mathbf{x})$ is $i$-homogeneous with respect to $\bm{\theta}$. If $f$ satisfies asp:nearhomoasp:initial-cond-gf, then for every $j\in[n]$, we must have Furthermore, we have $f^{(M)}(\bm{\theta}_s; \mathbf{x}_j) >0$ for all $j\in [n]$.

Theorems & Definitions (177)

  • Definition 1: Near-$M$-homogeneity
  • Lemma 3.1: Near-homogeneity order
  • Theorem 3.2: Risk convergence and margin improvement
  • Example 3.3: Necessity of \ref{['asp:initial-cond-gf']}
  • Theorem 3.4: Directional convergence
  • Theorem 3.5: KKT convergence
  • Definition 2: Near-$(M, N)$-homogeneity
  • Example 4.1
  • Lemma 4.2: Composition and multiplication rules
  • Corollary 4.3: Near-homogeneous networks
  • ...and 167 more