Table of Contents
Fetching ...

How Does Overparameterization Affect Features?

Ahmet Cagri Duzgun, Samy Jelassi, Yuanzhi Li

TL;DR

This work investigates why overparameterized networks learn superior representations by directly comparing an overparameterized base model $M_1$ with scaled underparameterized variants $M_$ under matched feature counts. It introduces two metrics, feature span error and feature performance, to quantify expressivity and downstream accuracy of the learned features, and employs ridge regression and linear probing on features from both architectures. Across vision and NLP tasks, the authors show that the feature spaces of $M_1$ are not spanned by concatenations of underparameterized networks and that feature residuals play a key role in achieving higher performance. The findings challenge the view that merely increasing width is sufficient for richer representations and provide a mechanistic explanation for the advantage of overparameterization, with implications for model design and theory. The work also introduces a toy mechanism illustrating how overparameterization can learn features that shallow concatenations struggle to capture, guiding future theoretical and empirical research on representation learning with wide networks.

Abstract

Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn.

How Does Overparameterization Affect Features?

TL;DR

This work investigates why overparameterized networks learn superior representations by directly comparing an overparameterized base model with scaled underparameterized variants under matched feature counts. It introduces two metrics, feature span error and feature performance, to quantify expressivity and downstream accuracy of the learned features, and employs ridge regression and linear probing on features from both architectures. Across vision and NLP tasks, the authors show that the feature spaces of are not spanned by concatenations of underparameterized networks and that feature residuals play a key role in achieving higher performance. The findings challenge the view that merely increasing width is sufficient for richer representations and provide a mechanistic explanation for the advantage of overparameterization, with implications for model design and theory. The work also introduces a toy mechanism illustrating how overparameterization can learn features that shallow concatenations struggle to capture, guiding future theoretical and empirical research on representation learning with wide networks.

Abstract

Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn.
Paper Structure (34 sections, 9 equations, 14 figures, 5 tables)

This paper contains 34 sections, 9 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: FSG with respect to overparameterized features (after training). Figures \ref{['fig:regress_valid_nlp_tuto']}, \ref{['fig:regress_valid_resnet_tuto']} and \ref{['fig:regress_valid_vgg_tuto']} display FSE($\mathcal{S}_{\alpha}^{(U^*)}\rightarrow \mathcal{M}_1$) in blue line and FSE($\mathcal{M}_1\rightarrow \mathcal{M}_1$) in red line in the Transformer, ResNet and VGG settings. While mildly low-width networks ($\alpha=1/2$) can fit the features of a trained overparameterized network well, low-width models ($\alpha=1/8$ or $\alpha=1/16$) have significantly lower performance.
  • Figure 2: FSG with respect to low-width concatenated networks features (after training). Figures \ref{['fig:regress_valid_nlp_totu']}, \ref{['fig:regress_valid_resnet_totu']} and \ref{['fig:regress_valid_vgg_totu']} display FSE($\mathcal{M}_1\rightarrow \mathcal{S}_{\alpha}^{(\bar{U})}$) in blue line and FSE($\mathcal{S}_{\alpha}^{(\bar{U})}\rightarrow \mathcal{M}_{\alpha}$) in red line in the Transformer, ResNet and VGG settings. As $\alpha$ decreases, the models have lower width and $\mathcal{M}_1$ further struggles to capture $\mathcal{M}_{\alpha}$.
  • Figure 3: Feature Performance. Figures \ref{['fig:lin_class_nlp']}, \ref{['fig:lin_class_resnet']}, \ref{['fig:lin_class_vgg']} display the feature performance obtained by concatenated low-width in blue line and by overparameterized networks in red line. As $\alpha$ decreases, low-width networks fail to match the test accuracy of a single overparameterized network.
  • Figure 4: Contribution of feature residuals to test accuracy. Figures \ref{['fig:resid_und_nlp']} and \ref{['fig:resid_und_resnet']} compares the concatenated low-width network $\mathcal{S}_{\alpha}^{(U^*)}$ (red line) with the same model to which we append feature residuals $R(\mathcal{S}_{\alpha}^{(U^*)}\rightarrow \mathcal{M}_1)$ --shortly $R_{\alpha\rightarrow 1}^{(U^*)}$ -- in blue line. These plots show that as $\alpha$ decreases, the test accuracy gains brought by the residuals increases. Figures \ref{['fig:resid_over_nlp']} and \ref{['fig:resid_over_resnet']} show that adding the residuals $R(\mathcal{M}_{1}\rightarrow\mathcal{S}_{\alpha}^{(U^*)})$ --shortly $R_{1\rightarrow \alpha}^{(U^*)}$ -- does not increase or lowers the performance of $\mathcal{M}_1$ (a result of adding redundant features).
  • Figure 5: Figures \ref{['fig:global_corr_f1']} and \ref{['fig:global_corr_f2']} display the total number of activated neurons by $\bm{v}_1$ (addition signal vector) and $\bm{v}_3$ (multiplication signal vector) throughout the training process. We display these curves for overparameterized ($\alpha=1$) and low-width concatenations ($\alpha\in\{1/300,1/120,1/3,1/2\}$) networks. Both models learn the addition vectors but the low-width ones fail to learn $\bm{v}_3$ since the curves for $\alpha\in\{1/300,1/200\}$ quickly collapse to 0. \ref{['fig:local_30']} displays the evolution of the correlation between an overparameterized model's neurons and the signal vectors. The correlations gradually increase to $1$, meaning that $M_1$ has learned the signal vectors.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Definition 2.1: Features
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4: Concatenation of features
  • Definition 2.5: Feature performance
  • Definition 2.6: Feature residuals