Table of Contents
Fetching ...

Implicit Bias of Gradient Descent on Linear Convolutional Networks

Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro

TL;DR

This paper analyzes the implicit bias induced by gradient descent when training over-parameterized linear networks. It shows a sharp contrast: fully connected networks (any depth) converge to the hard-margin SVM direction, while linear convolutional networks bias toward frequency-domain sparsity, with the depth determining the bridge norm $\\|\\widehat{\\boldsymbol{\beta}}\\|_{2/L}$. The authors provide a unified framework linking parameter-space homogeneity to predictor-space regularizers, deriving explicit forms for the induced penalties in both the time and Fourier domains. The work highlights a fundamental inductive bias arising solely from convolutional parameterization, suggesting broader implications for generalization and the design of optimization strategies in deep linear models.

Abstract

We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear support vector machine solution, regardless of depth.

Implicit Bias of Gradient Descent on Linear Convolutional Networks

TL;DR

This paper analyzes the implicit bias induced by gradient descent when training over-parameterized linear networks. It shows a sharp contrast: fully connected networks (any depth) converge to the hard-margin SVM direction, while linear convolutional networks bias toward frequency-domain sparsity, with the depth determining the bridge norm . The authors provide a unified framework linking parameter-space homogeneity to predictor-space regularizers, deriving explicit forms for the induced penalties in both the time and Fourier domains. The work highlights a fundamental inductive bias arising solely from convolutional parameterization, suggesting broader implications for generalization and the design of optimization strategies in deep linear models.

Abstract

We show that gradient descent on full-width linear convolutional networks of depth converges to a linear predictor related to the bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear support vector machine solution, regardless of depth.

Paper Structure

This paper contains 25 sections, 21 theorems, 109 equations, 1 figure.

Key Result

Theorem 1

For any depth $L$, almost all linearly separable datasets $\{\mathbf{x}_n,y_n\}_{n=1}^N$, almost all initializations $\mathbf{w}$, and any bounded sequence of step sizes $\{\eta_t\}_t$, consider the sequence gradient descent iterates $\mathbf{w}$ in eq. eq:gd for minimizing $\mathcal{L}_{\mathcal{P} then the limit direction is given by,

Figures (1)

  • Figure 1: Implicit bias of gradient descent for different linear network architectures.

Theorems & Definitions (33)

  • Theorem 1: Linear fully connected networks
  • Theorem 2: Linear convolutional networks of depth two
  • Theorem 2a: Linear Convolutional Networks of any Depth
  • Lemma 2
  • Definition : Homogeneous Polynomial
  • Theorem 3: Homogeneous Polynomial Parameterization
  • Lemma 3
  • Lemma 3
  • Lemma 3
  • Lemma 4
  • ...and 23 more