Understanding Deep Learning via Notions of Rank

Noam Razin

Understanding Deep Learning via Notions of Rank

Noam Razin

TL;DR

This work argues that deep learning generalization and expressiveness hinge on rank-based notions rather than purely norm-based complexity. It shows gradient-based optimization implicitly regularizes toward low rank across deep matrix, tensor, and hierarchical tensor factorizations, providing dynamical characterizations and rigorous results that norms alone cannot capture. The thesis then extends these ideas to graph neural networks via separation rank, showing how a partition’s walk index governs the modeled interactions and enabling practical tools like Walk Index Sparsification to preserve expressivity under edge removal. Collectively, the results offer a rank-centric theory of deep learning with concrete regularization strategies and graph-structure insights that can guide architecture design and data preprocessing for improved generalization and efficiency.

Abstract

Despite the extreme popularity of deep learning in science and industry, its formal understanding is limited. This thesis puts forth notions of rank as key for developing a theory of deep learning, focusing on the fundamental aspects of generalization and expressiveness. In particular, we establish that gradient-based training can induce an implicit regularization towards low rank for several neural network architectures, and demonstrate empirically that this phenomenon may facilitate an explanation of generalization over natural data (e.g., audio, images, and text). Then, we characterize the ability of graph neural networks to model interactions via a notion of rank, which is commonly used for quantifying entanglement in quantum physics. A central tool underlying these results is a connection between neural networks and tensor factorizations. Practical implications of our theory for designing explicit regularization schemes and data preprocessing algorithms are presented.

Understanding Deep Learning via Notions of Rank

TL;DR

Abstract

Paper Structure (157 sections, 85 theorems, 488 equations, 33 figures, 5 tables, 3 algorithms)

This paper contains 157 sections, 85 theorems, 488 equations, 33 figures, 5 tables, 3 algorithms.

Introduction
Generalization via Implicit Rank Minimization
Implicit Regularization in Deep Learning May Not Be Explainable by Norms
Background and Overview
Deep Matrix Factorization
Implicit Regularization Can Drive All Norms to Infinity
A Simple Matrix Completion Problem
Decreasing Loss Increases Norms
Convergence to Zero Loss
Robustness to Perturbations
Experiments
Analyzed Settings
From Matrix to Tensor Factorization
Implicit Regularization in Tensor Factorization
Background and Overview
...and 142 more sections

Key Result

Proposition 1

For any norm or quasi-norm over matrices $\norm{\cdot}$ and any $\epsilon > 0$, there exists a bounded interval $I_{\norm{\cdot} , \epsilon} \subset {\mathbb R}$ such that if ${\mathbf W} \in {\mathcal{S}}$ is an $\epsilon$-minimizer of $\norm{\cdot}$ ( i.e.$\norm{{\mathbf W}} \leq \inf_{{\mathbf W}

Figures (33)

Figure 1: Implicit regularization in matrix factorization can drive all norms (and quasi-norms) towards infinity. For the matrix completion problem defined in Section \ref{['mf:sec:analysis:setting']}, our analysis (Section \ref{['mf:sec:analysis:norms_up']}) implies that with small learning rate and initialization close to the origin, when the product matrix (Equation \ref{['mf:eq:prod_mat']}) is initialized to have positive determinant, gradient descent on a matrix factorization leads absolute value of unobserved entry to increase (which in turn means norms and quasi-norms increase) as loss decreases, i.e. as observations are fit. This is demonstrated in the plots above, which for representative runs, show absolute value of unobserved entry as a function of the loss (Equation \ref{['mf:eq:loss']}), with iteration number encoded by color. Each plot corresponds to a different depth for the matrix factorization, and presents runs with varying configurations of learning rate and initialization (abbreviated as "lr" and "init", respectively). Both balanced (Equation \ref{['mf:eq:balance']}) and unbalanced (layer-wise independent) random initializations were evaluated (former is marked by "(b)"). Independently for each depth, runs were iteratively carried out, with both learning rate and standard deviation for initialization decreased after each run, until the point where further reduction did not yield a noticeable change (presented runs are those from the last iterations of this process). Notice that depth, balancedness, and small learning rate and initialization, all contribute to the examined effect (absolute value of unobserved entry increasing as loss decreases), with the transition from depth $2$ to $3$ or more being most significant. Notice also that all runs initially follow the same curve, differing from one another in the point at which they divert (enter a phase where examined effect is lesser). A complete investigation of these phenomena is left for future work. For further implementation details, and similar experiments with different matrix dimensions, as well as perturbed and repositioned observations, see Appendix \ref{['mf:app:experiments']}.
Figure 2: Gradient descent over tensor factorization exhibits an implicit regularization towards low tensor rank. Plots above report results of tensor completion experiments, comparing: (i) minimization of loss (Equation \ref{['mf:eq:loss_tensor']}) via gradient descent over tensor factorization (Equation \ref{['mf:eq:tf']} with $R$ large enough for expressing any tensor) starting from (small) random initialization (method is abbreviated as "tf"); against (ii) trivial baseline that matches observations while holding zeros in unobserved locations --- equivalent to minimizing loss via gradient descent over linear parameterization ( i.e. directly over ${\mathcal{W}}$) starting from zero initialization (hence this method is referred to as "linear"). Each pair of plots corresponds to a randomly drawn low-rank ground truth tensor, from which multiple sets of observations varying in size were randomly chosen. The ground truth tensors corresponding to left and right pairs both have rank $1$ (for results obtained with additional ground truth ranks see Figure \ref{['mf:fig:experiment_tf_r3']} in Appendix \ref{['mf:app:experiments:further']}), with sizes $8$-by-$8$-by-$8$ (order $3$) and $8$-by-$8$-by-$8$-by-$8$ (order $4$) respectively. The plots in each pair show reconstruction errors (Frobenius distance from ground truth) and ranks (numerically estimated) of final solutions as a function of the number of observations in the task, with error bars spanning interquartile range ($25$'th to $75$'th percentiles) over multiple trials (differing in random seed for initialization), and markers showing median. For gradient descent over tensor factorization, we employed an adaptive learning rate scheme to reduce run times (see Appendix \ref{['mf:app:experiments:details']} for details), and iteratively ran with decreasing standard deviation for initialization, until the point at which further reduction did not yield a noticeable change (presented results are those from the last iterations of this process, with the corresponding standard deviations annotated by "init"). Notice that gradient descent over tensor factorization indeed exhibits an implicit tendency towards low rank (leading to accurate reconstruction of low-rank ground truth tensors), and that this tendency is stronger with smaller initialization. For further details and experiments see Appendix \ref{['mf:app:experiments']}.
Figure 3: Tensor factorization corresponds to a non-linear convolutional neural network (with polynomial non-linearity), analogously to how matrix factorization corresponds to a linear neural network. The input to the network is a tuple $( i_1, \ldots , i_N ) \in \{ 1 , \ldots , D_1 \} \times \cdots \times \{ 1 , \ldots , D_N \}$, represented via one-hot vectors $( {\mathbf x}_1, \ldots , {\mathbf x}_N ) \in {\mathbb R}^{D_1} \times \cdots \times {\mathbb R}^{D_N}$ (illustration assumes $D_1 = \cdots = D_N = D$ to avoid clutter). These vectors are processed by a hidden layer comprising: (i) locally connected linear operator with $R$ channels, the $r$'th one computing inner products against filters $( {\mathbf w}^{1}_r , \ldots , {\mathbf w}^{N}_r ) \in {\mathbb R}^{D_1} \times \cdots \times {\mathbb R}^{D_N}$ (this operator is referred to as "$1 {\times} 1$ conv", appealing to the case of weight sharing, i.e.${\mathbf w}^{1}_r = \cdots = {\mathbf w}^{N}_r$); followed by (ii) global pooling computing products of all activations in each channel (which induces polynomial non-linearity). The result of the hidden layer is then reduced through summation to a scalar --- output of the network. Overall, given input tuple $( i_1 , \ldots , i_N )$, the network outputs $( {\mathcal{W}} )_{i_1 , \ldots , i_N}$, where ${\mathcal{W}} \in {\mathbb R}^{D_1 \times \cdots \times D_N}$ is given by the tensor factorization in Equation \ref{['mf:eq:tf']}. Notice that the number of terms ($R$) and the tunable parameters ($\{ {\mathbf w}^{n}_r \}_{r , n}$) in the factorization respectively correspond to the width and the learnable filters of the network. Our tensor factorization (Equation \ref{['mf:eq:tf']}) was derived as an extension of a shallow (depth $2$) matrix factorization, and accordingly, the convolutional neural network it corresponds to is shallow (has a single hidden layer). Endowing the factorization with hierarchical structures would render it equivalent to a deep convolutional neural network (see cohen2016expressive for details). We will investigate the implicit regularization of these models in \ref{['chap:imp_reg_htf']}.
Figure 4: Prediction tasks over discrete variables can be viewed as tensor completion problems. Consider the task of learning a predictor from domain ${\mathcal{X}} = \{ 1 , \ldots , D_1 \} \times \cdots \times \{ 1 , \ldots , D_N \}$ to range ${\mathcal{Y}} = {\mathbb R}$ (figure assumes $N = 3$ and $D_1 = \cdots = D_N = 5$ for the sake of illustration). Each input sample is associated with a location in an order $N$ tensor with mode (axis) dimensions $D_1, \ldots, D_N$, where the value of a variable (depicted as a shade of gray) determines the index of the corresponding mode (marked by "A", "B" or "C"). The associated location stores the label of the sample. Under this viewpoint, training samples are observed entries, drawn according to an unknown distribution from a ground truth tensor. Learning a predictor amounts to completing the unobserved entries, with test error measured by (weighted) average reconstruction error. In many standard prediction tasks ( e.g. image recognition), only a small subset of the input domain has non-negligible probability. From the tensor completion perspective this means that observed entries reside in a restricted part of the tensor, and reconstruction error is weighted accordingly (entries outside the support of the distribution are neglected).
Figure 5: Dynamics of gradient descent over tensor factorization --- incremental learning of components yields low tensor rank solutions. Presented plots correspond to the task of completing a (tensor) rank $5$ ground truth tensor of size $10$-by-$10$-by-$10$-by-$10$ (order $4$) based on $2000$ observed entries chosen uniformly at random without repetition (smaller sample sizes led to solutions with tensor rank lower than that of the ground truth tensor). In each experiment, the $\ell_2$ loss (more precisely, Equation \ref{['tf:eq:tc_loss']} with $\ell ( z ) := z^2$) was minimized via gradient descent over a tensor factorization with $R = 1000$ components (large enough to express any tensor), starting from (small) random initialization. First (left) three plots show (Frobenius) norms of the ten largest components under three standard deviations for initialization --- $0.05, 0.01,$ and $0.005$. Further reduction of initialization scale yielded no noticeable change. The rightmost plot compares reconstruction errors (Frobenius distance from ground truth) from the three runs. To facilitate more efficient experimentation, we employed an adaptive learning rate scheme (see Appendix \ref{['tf:app:experiments:details']} for details). Notice that, in accordance with the theoretical analysis of Section \ref{['tf:sec:dynamic']}, component norms move slower when small and faster when large, creating an incremental process in which components are learned one after the other. This effect is enhanced as initialization scale is decreased, producing low tensor rank solutions that accurately reconstruct the low (tensor) rank ground truth tensor. In particular, even though the factorization consists of $1000$ components, when initialization is sufficiently small, only five (tensor rank of the ground truth tensor) substantially depart from zero. Appendix \ref{['tf:app:experiments']} provides further implementation details, as well as similar experiments with: (i) Huber loss (see Equation \ref{['tf:eq:huber_loss']}) instead of $\ell_2$ loss; (ii) ground truth tensors of different orders and (tensor) ranks; and (iii) tensor sensing (see Appendix \ref{['tf:app:sensing']}).
...and 28 more figures

Theorems & Definitions (186)

Conjecture 1: from gunasekar2017implicit, informally stated
Conjecture 2: based on arora2019implicit, informally stated
Proposition 1
proof : Proof sketch (proof in Appendix \ref{['mf:app:proofs:sol_set_norms']})
Definition 1: from roy2007effective
Definition 2
Proposition 2
proof : Proof sketch (proof in Appendix \ref{['mf:app:proofs:sol_set_rank']})
Theorem 1
proof : Proof sketch (proof in Appendix \ref{['mf:app:proofs:norms_up_finite']})
...and 176 more

Understanding Deep Learning via Notions of Rank

TL;DR

Abstract

Understanding Deep Learning via Notions of Rank

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (33)

Theorems & Definitions (186)