Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

Peng Wang; Xiao Li; Can Yaras; Zhihui Zhu; Laura Balzano; Wei Hu; Qing Qu

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, Qing Qu

TL;DR

The paper addresses how deep networks learn hierarchical representations by analyzing intermediate features in deep linear networks (DLNs). It introduces two metrics, $C_l$ for within-class compression and $D_l$ for between-class discrimination, and proves that, under mild data and weight assumptions, $C_l$ decays geometrically while $D_l$ grows linearly with the number of layers. The main result, encapsulated in Theorem NC, holds for DLNs and exhibits a neural-collapse-like behavior without relying on unconstrained features, with empirical evidence extending to nonlinear networks and transfer-learning implications via projection heads. The findings provide practical guidance for network architecture design, interpretation, and transfer learning, and they open avenues for extending the theory to nonlinear networks and broader data regimes. Overall, the work links depth to principled, quantitative shifts in feature geometry across layers, enriching our theoretical understanding of deep representation learning.

Abstract

Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at https://github.com/Heimine/PNC_DLN.

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

TL;DR

The paper addresses how deep networks learn hierarchical representations by analyzing intermediate features in deep linear networks (DLNs). It introduces two metrics,

for within-class compression and

for between-class discrimination, and proves that, under mild data and weight assumptions,

decays geometrically while

grows linearly with the number of layers. The main result, encapsulated in Theorem NC, holds for DLNs and exhibits a neural-collapse-like behavior without relying on unconstrained features, with empirical evidence extending to nonlinear networks and transfer-learning implications via projection heads. The findings provide practical guidance for network architecture design, interpretation, and transfer learning, and they open avenues for extending the theory to nonlinear networks and broader data regimes. Overall, the work links depth to principled, quantitative shifts in feature geometry across layers, enriching our theoretical understanding of deep representation learning.

Abstract

Paper Structure (70 sections, 14 theorems, 153 equations, 17 figures)

This paper contains 70 sections, 14 theorems, 153 equations, 17 figures.

Introduction
Empirical results on feature expansion and compression.
Empirical results on feature compression and discrimination.
Why study DLNs? Linear layers mimic deep layers in nonlinear networks for feature learning.
The role of depth in DLNs: improving generalization, feature compression, and training speed.
Our Contributions
Significance of our results.
Differences and connections to the existing literature.
Notation and Paper Organization
Notation.
Organization.
Preliminaries
Problem Setup
Multi-class classification problem.
DLNs for classification problems.
...and 55 more sections

Key Result

Theorem 4

Consider a $K$-class classification problem on the training data $(\bm X,\bm Y) \in \mathbb R^{d \times N} \times \mathbb{R}^{K\times N}$, where the matrix $\bm X$ satisfies Assumption AS:1 with parameter $\theta$. Suppose that we train an $L$-layer DLN with weights $\bm \Theta = \left\{ \bm W_l \ri (i) Progressive within-class feature compression: For $C_{l}$ in Definition def:nc1, it holds that

Figures (17)

Figure 1: Illustration of numerical rank and training accuracy across layers. We train two networks with different architectures on the CIFAR-10 dataset: (a) A 8-layer multilayer perceptron (MLP) network with ReLU activation, (b) A hybrid network consisting of a 3-layer MLP with ReLU activation followed by a 5-layer linear network. For each figure, we plot the numerical rank of the features of each layer and the training accuracy obtained by applying linear probing to the output of each layer, both against the number of layers. The green shading indicates that the features at these layers are approximately linearly separable, as evidenced by the near-perfect accuracy achieved by a linear classifier. The definition of numerical rank and additional experimental details are deferred to \ref{['subsubsec:resemb_1']}.
Figure 2: Visualization of feature compression & discrimination from shallow to deep layers. We consider the same setup as in \ref{['fig:intro1']}. For each network, we visualize the outputs of layers 1, 2, 4, and 6 on the CIFAR-10 dataset using the 2-dimensional UMAP plot mcinnes2018umap. Additional experimental details are deferred to \ref{['subsubsec:resemb_1']}.
Figure 3: Depth of DLNs lead to better generalization performance. We train hybrid networks consisting of a 2-layer MLP with ReLU activation followed by $(L-2)$ linear layers on the FashionMNIST and CIFAR-10 datasets, respectively. As a reference, we also train nonlinear networks comprised exclusively of MLP layers. We plot the test accuracy against the different number of layers averaged over $5$ different runs. It is observed that adding either linear layers or MLP layers can improve generalization performance. More experimental details are deferred to \ref{['subsubsec:linear_gene']}.
Figure 4: Progressive feature compression on DLNs trained with default initialization and real datasets. Using the DLNs trained in \ref{['fig:why_dln']}, we plot the within-class compression metrics $C_l$ (see Definition \ref{['def:nc1']}) against the layer indices. It is observed that progressive linear decay still (approximately) happens without the orthogonal initialization and datasets described in Assumption \ref{['AS:1']}.
Figure 5: Progressive feature compression and discrimination on both linear and nonlinear networks. We plot the feature compression and discrimination metrics defined in \ref{['eq:nc1']} for $l=1,\dots,L-1$ on both the linear network (top row) and nonlinear network (bottom row). We train both networks using a nearly orthogonal dataset as described in Assumption \ref{['AS:1']}, initializing the network weights satisfying \ref{['eq:init']}, with an initialization scale of $\xi=0.3$. We train both networks via gradient descent until convergence. In each figure, the $x$-axis denotes the number of layers from shallow to deep, with layer-0 denoting the inputs. In the left figures, the $y$-axis denotes the compression measure $C_l$ in the logarithmic scale; In the right figures, the $y$-axis denotes the discrimination measure $D_l$. More experimental details can be found in \ref{['subsec:exp-thm']}.
...and 12 more figures

Theorems & Definitions (15)

Definition 1: Intermediate layer-wise feature compression and discrimination
Theorem 4
Proposition 5
Lemma 6
Lemma 7
Lemma 8
Lemma 9
Lemma 10
Theorem 11
Lemma 12
...and 5 more

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

TL;DR

Abstract

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (15)