Table of Contents
Fetching ...

Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

Chenyang Zhang, Peifeng Gao, Difan Zou, Yuan Cao

TL;DR

The paper investigates how gradient-descent trained CNNs reflect the intrinsic data dimension through the stable rank of learned filters when faced with noisy backgrounds. It introduces a low-rank patch-based data model and analyzes a two-layer CNN with Hubered ReLU, proving that the CNN filters' stable rank stays close to the clean-data rank $2K$ across a broad noise regime, while the data stable rank explodes with noise. It provides convergence guarantees for training and test losses and supports the theory with experiments on MNIST, CIFAR-10, and synthetic data, showing robust rank behavior of filters versus data. The findings illuminate a form of implicit bias in gradient descent: CNNs preferentially learn a low-rank, clean-data subspace even under substantial background noise, with implications for understanding generalization and robustness.

Abstract

Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of "ranks" of neural networks and their relation to the rank of data. In this work, we study the rank of convolutional neural networks (CNNs) trained by gradient descent, with a specific focus on the robustness of the rank to image background noises. Specifically, we point out that, when adding background noises to images, the rank of the CNN trained with gradient descent is affected far less compared with the rank of the data. We support our claim with a theoretical case study, where we consider a particular data model to characterize low-rank clean images with added background noises. We prove that CNNs trained by gradient descent can learn the intrinsic dimension of clean images, despite the presence of relatively large background noises. We also conduct experiments on synthetic and real datasets to further validate our claim.

Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

TL;DR

The paper investigates how gradient-descent trained CNNs reflect the intrinsic data dimension through the stable rank of learned filters when faced with noisy backgrounds. It introduces a low-rank patch-based data model and analyzes a two-layer CNN with Hubered ReLU, proving that the CNN filters' stable rank stays close to the clean-data rank across a broad noise regime, while the data stable rank explodes with noise. It provides convergence guarantees for training and test losses and supports the theory with experiments on MNIST, CIFAR-10, and synthetic data, showing robust rank behavior of filters versus data. The findings illuminate a form of implicit bias in gradient descent: CNNs preferentially learn a low-rank, clean-data subspace even under substantial background noise, with implications for understanding generalization and robustness.

Abstract

Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of "ranks" of neural networks and their relation to the rank of data. In this work, we study the rank of convolutional neural networks (CNNs) trained by gradient descent, with a specific focus on the robustness of the rank to image background noises. Specifically, we point out that, when adding background noises to images, the rank of the CNN trained with gradient descent is affected far less compared with the rank of the data. We support our claim with a theoretical case study, where we consider a particular data model to characterize low-rank clean images with added background noises. We prove that CNNs trained by gradient descent can learn the intrinsic dimension of clean images, despite the presence of relatively large background noises. We also conduct experiments on synthetic and real datasets to further validate our claim.

Paper Structure

This paper contains 23 sections, 23 theorems, 133 equations, 4 figures.

Key Result

Proposition 2.3

Suppose that $d \geq \widetilde{\Omega}(n^4)$, and $K, P\leq O(1)$. For any positive $\delta$ satisfying $\log(1/\delta)\leq O(d)$, the following inequalities concerning the stable rank of $\widehat{\mathbf{X}}$ hold with probability at least $1-\delta$.

Figures (4)

  • Figure 1: Ranks of data and filters under different noise levels. In (a), we perform a principal component analysis (PCA) to a subset of MNIST images to reduce the number of principal components to 20, which represents the rank of clean data. We then add background noise patches around the obtained low-rank image, and train a two-layer CNN with fixed second layer weights until convergence. We then calculate both the rank of the noisy images and the rank of the matrix consisting of all the convolutional filters of the CNN. When calculating ranks, eigenvalues smaller than $1/100$ of the largest eigenvalue are ignored. The curves of filter rank and data rank with respect to the noise level are plotted. In (b), we conduct a similar set of experiments on the CIFAR-10 data set.
  • Figure 2: Illustration of the stable ranks of the data matrix and CNN filters under different noise levels $\sigma_{\mathrm{noise}}$ and sample sizes $n$. In Region I, stable ranks of both the data matrix and the CNN filters stay close to the rank of the clean data. In Region II, the stable rank of CNN remains close to the rank of the clean data, while the stable rank of the data matrix explodes. In Region III, the stable ranks of both the data matrix and CNN filters explode. It is evident that the stable rank of CNN remains close to the rank of the clean data under a much wider regime, demonstrating its robustness to background noises.
  • Figure 3: Illustration of a training image from the MNIST dataset, reduced to rank $10$ and padded with a circle of noise.
  • Figure 4: Rank of the data and learned filters under different noise levels. Here $x$-axis represents the value of the noise level $\sigma_{\mathrm{noise}}$, and $y$-axis is the rank. From the figures, it can be clearly observed that the data rank increases rapidly as the noise becomes stronger, while the rank of the CNN filters remains robust against the noise, and keeps being the same as the rank of clean data.

Theorems & Definitions (47)

  • Definition 2.1
  • Definition 2.2: Low-rank clean images with noisy backgrounds
  • Proposition 2.3
  • Theorem 3.2
  • Theorem 4.1
  • Corollary 4.2
  • Lemma 6.1
  • Lemma 6.2
  • Lemma 6.3
  • proof : Proof of Theorem \ref{['thm:main_result3']}
  • ...and 37 more