Table of Contents
Fetching ...

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

Simon S. Du, Wei Hu, Jason D. Lee

TL;DR

<3-5 sentences high-level summary>This work analyzes how first-order optimization methods implicitly regularize deep homogeneous models, showing that gradient flow preserves the differences between squared layer norms across adjacent layers, which enforces auto-balancing when initialization is small. It then leverages this invariance to prove that gradient descent on unregularized, non-convex problems—specifically asymmetric matrix factorization—converges to global optima under suitable diminishing step sizes, with a tighter linear-convergence result in the rank-1 case. Empirical experiments on deep networks with ReLU activations validate the theoretical auto-balancing, illustrating that inter-layer norm differences remain small and layer-norm ratios approach unity. The findings offer a fundamental perspective on optimization dynamics in deep homogeneous models and suggest a new angle for analyzing training behavior beyond traditional smoothness assumptions.

Abstract

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes $η_t = O\left(t^{-\left( \frac12+δ\right)} \right)$ ($0<δ\le\frac12$) automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank-$1$ asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

TL;DR

<3-5 sentences high-level summary>This work analyzes how first-order optimization methods implicitly regularize deep homogeneous models, showing that gradient flow preserves the differences between squared layer norms across adjacent layers, which enforces auto-balancing when initialization is small. It then leverages this invariance to prove that gradient descent on unregularized, non-convex problems—specifically asymmetric matrix factorization—converges to global optima under suitable diminishing step sizes, with a tighter linear-convergence result in the rank-1 case. Empirical experiments on deep networks with ReLU activations validate the theoretical auto-balancing, illustrating that inter-layer norm differences remain small and layer-norm ratios approach unity. The findings offer a fundamental perspective on optimization dynamics in deep homogeneous models and suggest a new angle for analyzing training behavior beyond traditional smoothness assumptions.

Abstract

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes () automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank- asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.

Paper Structure

This paper contains 24 sections, 12 theorems, 61 equations, 2 figures.

Key Result

Theorem 2.1

For any $h\in[N-1]$ and $i\in[n_h]$, we have

Figures (2)

  • Figure 1: Experiments on the matrix factorization problem with objective functions \ref{['eqn:intro_mf_obj']} and \ref{['eqn:intro_mf_reg_obj']}. Red lines correspond to running GD on the objective function \ref{['eqn:intro_mf_obj']}, and blue lines correspond to running GD on the objective function \ref{['eqn:intro_mf_reg_obj']}.
  • Figure 2: Balancedness of a 3-layer neural network.

Theorems & Definitions (30)

  • Theorem 2.1: Balanced incoming and outgoing weights at every neuron
  • Corollary 2.1: Balanced weights across layers
  • Theorem 2.2: Stronger balancedness property for linear activation
  • Theorem 2.3
  • proof : Proof of Theorem \ref{['thm:conserved-neuron']}
  • Theorem 3.1
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.2: Approximate balancedness and linear convergence of GD for rank-$1$ matrix factorization
  • proof : Proof of Theorem \ref{['thm:conserved-linear']}
  • ...and 20 more