Table of Contents
Fetching ...

Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture

Sajad Movahedi, Antonio Orvieto, Seyed-Mohsen Moosavi-Dezfooli

TL;DR

The paper tackles why neural architectures induce different inductive biases by proposing the Geometric Invariance Hypothesis (GIH), which states that a network's input-space geometry can only change in a subspace determined by the architecture. It introduces the average geometry $\mathbf{G}_{\mathcal{F}}^{t}$ and average geometry evolution $\Delta_{\mathcal{F}}^{t}$ to quantify how input-space curvature evolves during training and shows that, at initialization, this evolution is governed by the data covariance $\mathbf{S}$ projected onto the initial geometry, i.e., $\Delta_{\mathcal{F}}^{0} \propto \mathbf{G}_{\mathcal{F}}^{0}\mathbf{S}\mathbf{G}_{\mathcal{F}}^{0}$. The authors provide theoretical results and empirical evidence across isotropic models (MLP) and non-isotropic architectures (CNNs, ResNet-like) that the data-geometry interaction is architecture-dependent, with CNNs effectively projecting $\mathbf{S}$ through $\mathbf{G}_{\mathcal{F}}$, leading to invariant directions and impacting generalization. They connect these geometric insights to the generalization gap and the simplicity bias, show how discriminant features align with initial geometry directions, and present practical analyses showing how geometry informs sample importance and feature removal strategies. The work suggests a unified geometric framework to understand how architecture and data jointly shape inductive biases and generalization, with implications for architecture design and data conditioning in real-world tasks.

Abstract

In this paper, we propose the $\textit{geometric invariance hypothesis (GIH)}$, which argues that the input space curvature of a neural network remains invariant under transformation in certain architecture-dependent directions during training. We investigate a simple, non-linear binary classification problem residing on a plane in a high dimensional space and observe that$\unicode{x2014}$unlike MLPs$\unicode{x2014}$ResNets fail to generalize depending on the orientation of the plane. Motivated by this example, we define a neural network's $\textbf{average geometry}$ and $\textbf{average geometry evolution}$ as compact $\textit{architecture-dependent}$ summaries of the model's input-output geometry and its evolution during training. By investigating the average geometry evolution at initialization, we discover that the geometry of a neural network evolves according to the data covariance projected onto its average geometry. This means that the geometry only changes in a subset of the input space when the average geometry is low-rank, such as in ResNets. This causes an architecture-dependent invariance property in the input space curvature, which we dub GIH. Finally, we present extensive experimental results to observe the consequences of GIH and how it relates to generalization in neural networks.

Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture

TL;DR

The paper tackles why neural architectures induce different inductive biases by proposing the Geometric Invariance Hypothesis (GIH), which states that a network's input-space geometry can only change in a subspace determined by the architecture. It introduces the average geometry and average geometry evolution to quantify how input-space curvature evolves during training and shows that, at initialization, this evolution is governed by the data covariance projected onto the initial geometry, i.e., . The authors provide theoretical results and empirical evidence across isotropic models (MLP) and non-isotropic architectures (CNNs, ResNet-like) that the data-geometry interaction is architecture-dependent, with CNNs effectively projecting through , leading to invariant directions and impacting generalization. They connect these geometric insights to the generalization gap and the simplicity bias, show how discriminant features align with initial geometry directions, and present practical analyses showing how geometry informs sample importance and feature removal strategies. The work suggests a unified geometric framework to understand how architecture and data jointly shape inductive biases and generalization, with implications for architecture design and data conditioning in real-world tasks.

Abstract

In this paper, we propose the , which argues that the input space curvature of a neural network remains invariant under transformation in certain architecture-dependent directions during training. We investigate a simple, non-linear binary classification problem residing on a plane in a high dimensional space and observe thatunlike MLPsResNets fail to generalize depending on the orientation of the plane. Motivated by this example, we define a neural network's and as compact summaries of the model's input-output geometry and its evolution during training. By investigating the average geometry evolution at initialization, we discover that the geometry of a neural network evolves according to the data covariance projected onto its average geometry. This means that the geometry only changes in a subset of the input space when the average geometry is low-rank, such as in ResNets. This causes an architecture-dependent invariance property in the input space curvature, which we dub GIH. Finally, we present extensive experimental results to observe the consequences of GIH and how it relates to generalization in neural networks.

Paper Structure

This paper contains 36 sections, 9 theorems, 77 equations, 15 figures, 1 table.

Key Result

Theorem 3.1

Let $\mathcal{F}$ be the family of MLPs with a single hidden layer of size $n$ and ReLU non-linearity. Assuming that we use the SSE loss, then as the input dimension $D$ and the model width $n$ become larger, the average geometry at initialization $\Delta_{\mathcal{F}}$ approaches the data covarianc

Figures (15)

  • Figure 1: The correlation between $\mathbf{G}_{\mathcal{F}}^t =\mathbf{G}_{\mathcal{F}, \mathcal{N}\left(0, \mathbf{I}\right)}^t$ and $\mathbf{S}$ and $\mathbf{G}_{\mathcal{F}}\mathbf{S}\mathbf{G}_{\mathcal{F}}$ for the (a) MLP, (b) LeNet, (c) ResNet18 without batch normalization, and (d) ViT without layer normalization on CIFAR-2. Note that $D=32\times 32\times 3=3072$, which means the expected cosine similarity of two randomly generated vectors in the input space is $\mathcal{O}\left(1/\sqrt{3072}\right)\approx0.02$. Therefore, we consider the correlations significant.
  • Figure 2: The train accuracy (green lines) and velocity $\dot{\mathbf{G}}^t(\cdot,\cdot)$ (blue lines) of the (a) MLP, (b) LeNet, (c) ResNet18 without batch normalization, and (d) ViT without layer normalization on two synthetic datasets: $\mathbf{G}_{\mathcal{F}}$ covariance$\mathbf{x}\sim \mathcal{N}\left(0, \mathbf{G}_{\mathcal{F}}/\Vert \mathbf{G}_{\mathcal{F}}\Vert_2\right)$ and $\text{flip}\left(\mathbf{G}_{\mathcal{F}}\right)$covariance$\mathbf{x}\sim \mathcal{N}\left(0, \text{flip}\left(\mathbf{G}_{\mathcal{F}}\right) /\Vert \text{flip}\left(\mathbf{G}_{\mathcal{F}}\right)\Vert_2\right)$ with random labels. A horizontal line for velocity indicates no change in the geometry.
  • Figure 3: Test accuracy of (a) LeNet, (b) ResNet18 without batch normalization, and (c) ViT without layer normalization on the CIFAR-2 data with for various types of decision boundary. The $i^{th}$ x-axis point corresponds to a dataset wherein the discriminant feature is the $i^{th}$ eigenvalue of $\mathbf{G}_{\mathcal{F}}\mathbf{S}\mathbf{G}_{\mathcal{F}}$ in descending order.
  • Figure 4: The test accuracy of the linear and non-linear components of (a) MLP, (b) LeNet, (c) ResNet18, and (d) ViT on the synthetic data with both linear and non-linear components. The x-axis corresponds to the eigenvalue index in descending order.
  • Figure 5: Test accuracy of (a) LeNet, (b) ResNet18 without batch normalization, and (c) ViT without layer normalization for the feature distribution experiment. We report results for CIFAR-2, with features eliminated up to the index number of the generalized eigenvectors of $\mathbf{G}_{\mathcal{F}}$ and $\mathbf{S}$. For a fair comparison and using random orthogonal directions, we delete features until similar variability is removed from the data support.
  • ...and 10 more figures

Theorems & Definitions (13)

  • Definition 2.1
  • Definition 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Conjecture 1
  • Conjecture 2
  • Proposition A.1
  • Theorem A.2
  • Theorem A.3
  • Corollary A.4
  • ...and 3 more