Table of Contents
Fetching ...

Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff

Arthur Jacot

TL;DR

This work analyzes how deep, $L_{2}$-regularized neural networks bias learned mappings toward low-dimensional representations via the Bottleneck rank, and it develops finite-depth corrections to the infinite-depth picture. By expanding the representation cost as $R(f;oldsymbol{ ext{Ω}},L)=L R^{(0)}(f;oldsymbol{ ext{Ω}})+R^{(1)}(f;oldsymbol{ ext{Ω}})+ rac{1}{L}R^{(2)}(f;oldsymbol{ ext{Ω}})+O(L^{-2})$, it introduces a regularity term $R^{(1)}$ that upper-bounds $2\,igl|\log |Jf(x)|_{+}igr|$ and is subadditive under composition and addition, and a second correction $R^{(2)}$ that controls convergence properties in linear and, partially, nonlinear settings. The authors prove a Bottleneck structure in large-depth networks: for balanced parameters with rank $k$ at a given input, most layers have weight matrices whose top $k$ singular values are near $1$ and the remaining singular values shrink like $O(L^{-1/2})$, implying that hidden representations are typically $k$-dimensional across most layers, provided the NTK scales as $O(L)$. They also show that rank-underestimating minima tend to be narrow due to NTK blow-up, offering an explanation for why gradient-based training tends to recover the BN-rank rather than underspecify it. A numerical symmetry-learning experiment illustrates how networks acquire low BN-rank representations by capturing task symmetries in a two-dimensional latent space.

Abstract

Previous work has shown that DNNs with large depth $L$ and $L_{2}$-regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank $R^{(0)}(f)$ of the learned function $f$, conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure $R^{(1)}$ of regularity which bounds the pseudo-determinant of the Jacobian $\left|Jf(x)\right|_{+}$ and is subadditive under composition and addition. This formalizes a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps, allowing the network to learn the `right' inner dimension. Finally, we prove the conjectured bottleneck structure in the learned features as $L\to\infty$: for large depths, almost all hidden representations are approximately $R^{(0)}(f)$-dimensional, and almost all weight matrices $W_{\ell}$ have $R^{(0)}(f)$ singular values close to 1 while the others are $O(L^{-\frac{1}{2}})$. Interestingly, the use of large learning rates is required to guarantee an order $O(L)$ NTK which in turns guarantees infinite depth convergence of the representations of almost all layers.

Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff

TL;DR

This work analyzes how deep, -regularized neural networks bias learned mappings toward low-dimensional representations via the Bottleneck rank, and it develops finite-depth corrections to the infinite-depth picture. By expanding the representation cost as , it introduces a regularity term that upper-bounds and is subadditive under composition and addition, and a second correction that controls convergence properties in linear and, partially, nonlinear settings. The authors prove a Bottleneck structure in large-depth networks: for balanced parameters with rank at a given input, most layers have weight matrices whose top singular values are near and the remaining singular values shrink like , implying that hidden representations are typically -dimensional across most layers, provided the NTK scales as . They also show that rank-underestimating minima tend to be narrow due to NTK blow-up, offering an explanation for why gradient-based training tends to recover the BN-rank rather than underspecify it. A numerical symmetry-learning experiment illustrates how networks acquire low BN-rank representations by capturing task symmetries in a two-dimensional latent space.

Abstract

Previous work has shown that DNNs with large depth and -regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank of the learned function , conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure of regularity which bounds the pseudo-determinant of the Jacobian and is subadditive under composition and addition. This formalizes a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps, allowing the network to learn the `right' inner dimension. Finally, we prove the conjectured bottleneck structure in the learned features as : for large depths, almost all hidden representations are approximately -dimensional, and almost all weight matrices have singular values close to 1 while the others are . Interestingly, the use of large learning rates is required to guarantee an order NTK which in turns guarantees infinite depth convergence of the representations of almost all layers.
Paper Structure (24 sections, 17 theorems, 91 equations, 2 figures)

This paper contains 24 sections, 17 theorems, 91 equations, 2 figures.

Key Result

Theorem 3

For all inputs $x$ where $\mathrm{Rank}Jf(x)=R^{(0)}(x)$, $R^{(1)}(f)\geq2\log\left|Jf(x)\right|_{+}$, furthermore:

Figures (2)

  • Figure 1: (a) Plot of the parameter norm at the end of training ($\lambda=0.001$) over a range of depths, colored acoording to the rank (# of sing. vals above 0.1) of the weight matrices $W_{L/2}$ in the middle of the network, and marked with a dot '.' or cross 'x' depending on whether the final train cost is below or above 0.1. The training data is synthetic and designed to have a optimal rank $k^{*}=2$. We see different ranges of depth where the network converges to different rank, with larger depths leading to smaller rank, until training fails and recover the zero parameters for $L>25$. Within each range the norm $\left\Vert \theta\right\Vert ^{2}$ is well approximated by a affine function with slope equal to the rank. (b) Plot of the singular values of $W_{\ell}$ throughout the networks for 4 trials, we see that the bottleneck structure remains essentially the same throughout each range of depth, with only the middle low-rank part growing with the depth.
  • Figure 2: A depth $L=25$ network with a width of $200$ trained on the task described in Section \ref{['subsec:Symmery-Learning']} with a ridge $\lambda=0.0002$. (a) Singular values of the weight matrices of the network, showing two outliers in the bottleneck, which implies that the network has recovered the true rank of 2. (b) Hidden representation of the $6$-th layer projected to the first two dimensions, we see how images of GD paths do not cross in this space, showing that the dynamics on these two dimensions are self-consistent. (c) The distance $\left\Vert \alpha_{2}(x_{0})-\alpha_{2}(x)\right\Vert$ in the second hidden layer between the representations at a fixed point $x_{0}$ (at the white pixel) and another point $x$ on a plane orthogonal to the axis $w$ of rotation, we see that all points on the same symmetry orbit are collapsed together, proving that the network has learned the rotation symmetry.

Theorems & Definitions (30)

  • Remark 1
  • Remark 2
  • Theorem 3
  • Proposition 4
  • Proposition 5
  • Theorem 6
  • Theorem 7
  • Corollary 8
  • Proposition 9
  • Theorem 10
  • ...and 20 more