Table of Contents
Fetching ...

Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse

Bradley T. Baker, Barak A. Pearlmutter, Robyn Miller, Vince D. Calhoun, Sergey M. Plis

TL;DR

This work provides a theory-first investigation into gradient rank dynamics in deep neural networks, arguing that architectural choices such as bottlenecks, parameter tying, and activation linearity deterministically bound gradient rank throughout training, extending beyond terminal Neural Collapse. Using reverse-mode auto-differentiation as the analytical core, the authors derive explicit bounds for linear networks and extend them to recurrent and convolutional architectures, as well as Leaky-ReLU nonlinearities, accompanied by practical bounds for numerical rank via Hadamard-product inequalities. The contributions include exact gradient-rank bounds for linear networks, extensions to RNNs and CNNs with shared parameters, a bound on Leaky-ReLU singular-value contributions, and thorough empirical verification on synthetic and real-world datasets showing how design choices constrain learning dynamics. The findings have practical implications for deep learning engineering, offering design guidelines (e.g., bottlenecks, sequence length, activation linearity) and informing distributed training approaches that rely on low-rank gradient decompositions, while laying groundwork for future studies connecting gradient rank dynamics to Neural Collapse and training performance.

Abstract

Our understanding of learning dynamics of deep neural networks (DNNs) remains incomplete. Recent research has begun to uncover the mathematical principles underlying these networks, including the phenomenon of "Neural Collapse", where linear classifiers within DNNs converge to specific geometrical structures during late-stage training. However, the role of geometric constraints in learning extends beyond this terminal phase. For instance, gradients in fully-connected layers naturally develop a low-rank structure due to the accumulation of rank-one outer products over a training batch. Despite the attention given to methods that exploit this structure for memory saving or regularization, the emergence of low-rank learning as an inherent aspect of certain DNN architectures has been under-explored. In this paper, we conduct a comprehensive study of gradient rank in DNNs, examining how architectural choices and structure of the data effect gradient rank bounds. Our theoretical analysis provides these bounds for training fully-connected, recurrent, and convolutional neural networks. We also demonstrate, both theoretically and empirically, how design choices like activation function linearity, bottleneck layer introduction, convolutional stride, and sequence truncation influence these bounds. Our findings not only contribute to the understanding of learning dynamics in DNNs, but also provide practical guidance for deep learning engineers to make informed design decisions.

Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse

TL;DR

This work provides a theory-first investigation into gradient rank dynamics in deep neural networks, arguing that architectural choices such as bottlenecks, parameter tying, and activation linearity deterministically bound gradient rank throughout training, extending beyond terminal Neural Collapse. Using reverse-mode auto-differentiation as the analytical core, the authors derive explicit bounds for linear networks and extend them to recurrent and convolutional architectures, as well as Leaky-ReLU nonlinearities, accompanied by practical bounds for numerical rank via Hadamard-product inequalities. The contributions include exact gradient-rank bounds for linear networks, extensions to RNNs and CNNs with shared parameters, a bound on Leaky-ReLU singular-value contributions, and thorough empirical verification on synthetic and real-world datasets showing how design choices constrain learning dynamics. The findings have practical implications for deep learning engineering, offering design guidelines (e.g., bottlenecks, sequence length, activation linearity) and informing distributed training approaches that rely on low-rank gradient decompositions, while laying groundwork for future studies connecting gradient rank dynamics to Neural Collapse and training performance.

Abstract

Our understanding of learning dynamics of deep neural networks (DNNs) remains incomplete. Recent research has begun to uncover the mathematical principles underlying these networks, including the phenomenon of "Neural Collapse", where linear classifiers within DNNs converge to specific geometrical structures during late-stage training. However, the role of geometric constraints in learning extends beyond this terminal phase. For instance, gradients in fully-connected layers naturally develop a low-rank structure due to the accumulation of rank-one outer products over a training batch. Despite the attention given to methods that exploit this structure for memory saving or regularization, the emergence of low-rank learning as an inherent aspect of certain DNN architectures has been under-explored. In this paper, we conduct a comprehensive study of gradient rank in DNNs, examining how architectural choices and structure of the data effect gradient rank bounds. Our theoretical analysis provides these bounds for training fully-connected, recurrent, and convolutional neural networks. We also demonstrate, both theoretically and empirically, how design choices like activation function linearity, bottleneck layer introduction, convolutional stride, and sequence truncation influence these bounds. Our findings not only contribute to the understanding of learning dynamics in DNNs, but also provide practical guidance for deep learning engineers to make informed design decisions.
Paper Structure (22 sections, 30 equations, 11 figures, 3 tables)

This paper contains 22 sections, 30 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: For a 3-layer Linear FC network, we plot the mean rank of gradients, activation, and deltas change with respect to the size of a neuron bottleneck in the middle layer. The axis axis provides the name of the module, with depth increasing from right to left. In each panel, green, blue and orange bars represent the estimated rank of gradients, activations and deltas respectively. Black vertical lines on a bar indicate the standard error in the mean estimated rank across folds and model seeds.
  • Figure 2: For a 3-layer Elman-Cell RNN, we show how mean rank of gradients, activation, and deltas change with respect to the number of timepoints used in truncated BPTT. The x axis groups particular modules, with depth increasing from right to left. Each colored bar shows the mean estimated rank over multiple seeds and folds using a different sequence length for truncated BPTT.
  • Figure 3: A numerical exploration of the derived boundary over which a given eigenvalue computed on a Leaky-ReLU activation $\sigma_k$ will cease to contribute to the rank. For each experiment we generate 1000 $M \times M$ matrices with a known latent rank $k$, and we compute the singular value bound for contribution to the rank using the singular values with the post-activation singular values (blue curve) and then the pre-activation singular values using equation 5 (orange curve). We also plot the error between the post and pre-activation bounds (green curve). \ref{['eq:leaky_relu_bound']} with a blue dotted line. For each experiment we show how the bound changes as a function of the linearity $\alpha$ of the Leaky-ReLU activation function.
  • Figure 4: For a 5-layer (6 weight) FC network with Leaky-ReLU activations, we show how mean rank of gradients, activation, and deltas change with respect to the negative slope $\alpha$ of the noninearity. Layer sizes are plotted on the x axis with the depth increasing from left to right. We enforce a bottleneck of 2 neurons in the central layer. For each module, we estimate the rank and provide a colorbar corresponding to the level of nonlinearity increasing in the range of [0,1].
  • Figure 5: Low-Dimensional Input
  • ...and 6 more figures