Table of Contents
Fetching ...

These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

Xingyu Alice Yang, Jianyu Zhang, Léon Bottou

TL;DR

The paper investigates why pretraining on a broad mixture of tasks does not guarantee parity with task-specific training, identifying a fundamental information-saturation bottleneck caused by sparsity bias in deep networks. It analyzes transfer via linear probing and neural tangent kernel features, provides a theoretical counterexample, and surveys empirical evidence of feature loss across spurious and domain-specific settings. To mitigate the bottleneck, the authors propose time-concatenation ensembles that create richer representations within a fixed compute budget, achieving measurable gains in transfer to unseen distributions while maintaining similar performance on in-distribution data. The results suggest that richer, more diverse representations—via ensembles or targeted pretraining—can substantially boost transfer learning, challenging the notion that ever-larger mono-model foundations alone deliver robust generalization.

Abstract

Transfer learning is widely used to adapt large pretrained models to new tasks with only a small amount of new data. However, a challenge persists -- the features from the original task often do not fully cover what is needed for unseen data, especially when the relatedness of tasks is not clear. Since deep learning models tend to learn very sparse representations, they retain only the minimal features required for the initial training while discarding potentially ones for downstream transfer. A theoretical framework developed in this work demonstrates that such pretraining captures inconsistent aspects of the data distribution, therefore, inducing transfer bias. To address this limitation, we propose an inexpensive ensembling strategy that aggregates multiple models to generate richer feature representations. On ResNet, this approach yields a $9\%$ improvement in transfer accuracy without incurring extra pretraining cost. We also present empirical evidence from a range of deep learning studies, confirming that the phenomenon is pervasive across modern deep learning architectures. These results suggests that relying solely on large pretrained networks is not always the most effective way to improve model generalization. Instead, fostering richer, more diverse representations -- e.g. - through model ensembles -- can substantially enhance transfer learning performance.

These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

TL;DR

The paper investigates why pretraining on a broad mixture of tasks does not guarantee parity with task-specific training, identifying a fundamental information-saturation bottleneck caused by sparsity bias in deep networks. It analyzes transfer via linear probing and neural tangent kernel features, provides a theoretical counterexample, and surveys empirical evidence of feature loss across spurious and domain-specific settings. To mitigate the bottleneck, the authors propose time-concatenation ensembles that create richer representations within a fixed compute budget, achieving measurable gains in transfer to unseen distributions while maintaining similar performance on in-distribution data. The results suggest that richer, more diverse representations—via ensembles or targeted pretraining—can substantially boost transfer learning, challenging the notion that ever-larger mono-model foundations alone deliver robust generalization.

Abstract

Transfer learning is widely used to adapt large pretrained models to new tasks with only a small amount of new data. However, a challenge persists -- the features from the original task often do not fully cover what is needed for unseen data, especially when the relatedness of tasks is not clear. Since deep learning models tend to learn very sparse representations, they retain only the minimal features required for the initial training while discarding potentially ones for downstream transfer. A theoretical framework developed in this work demonstrates that such pretraining captures inconsistent aspects of the data distribution, therefore, inducing transfer bias. To address this limitation, we propose an inexpensive ensembling strategy that aggregates multiple models to generate richer feature representations. On ResNet, this approach yields a improvement in transfer accuracy without incurring extra pretraining cost. We also present empirical evidence from a range of deep learning studies, confirming that the phenomenon is pervasive across modern deep learning architectures. These results suggests that relying solely on large pretrained networks is not always the most effective way to improve model generalization. Instead, fostering richer, more diverse representations -- e.g. - through model ensembles -- can substantially enhance transfer learning performance.

Paper Structure

This paper contains 32 sections, 1 theorem, 11 equations, 5 figures, 3 tables.

Key Result

Proposition 2.1

If $\mathrm{Cor}_{P^{[i]}}[\varphi(X),Y] \neq 0$, for some component $P^{[i]}$, then for almost all mixtures $P^{[\mathrm{mix}]}$; only a measure-zero set of mixture proportions can cancel out the correlation.

Figures (5)

  • Figure 1: Classifier trained directly from $P^{[i]}$ versus transferred from the mixture $P^{[\mathrm{mix}]}$.
  • Figure 2: Four subdistributions and a combined mixture distribution are represented by red points (labeled $\color{red}+1$) and blue points (labeled $\color{blue}-1$) Each subdistribution includes three points; two points contain an equal number of examples, while the third point contains twice as many. The size of each point reflects the number of examples it represents. The final mixture distribution is a weighted average of these four component distributions (not drawn to scale).
  • Figure 4: Richer Representations via Concatenation An ensemble model (ResNet50Cat4) is created by concatenating four ResNet models trained separately on ImageNet using different random seed. It is evaluated against a baseline model (ResNet50W2), which is a single ResNet model of the same size, trained once on ImageNet.
  • Figure 5: Increased Returns in Accuracy as Scale Increases(Left)SSL, 100M to 1B params: The concatenated methods (in red and purple) outperform the baseline (dotted blue curve) on both SWAV trained on unlabeled ImageNet1K (top), and SEER on INSTAGRAM1B (bottom). (Right)ViT, 100M to 400M params: Concatenated representations (purple) outperform the baseline (red) consistently during transfer in both original (top) and modified (bottom) vision transformers (ViT)
  • Figure 6: Ensembling Boosts ResNet50 Transfer Accuracy Without Extra ComputeBaseline: A single (ResNet50Cat1) model is trained for 400k iters on ImageNet1k (plus $50$k iters for a stronger argument). Ensemble: A (ResNet50Cat4) model contains four ResNet50 models trained separately (with different random seeds) on ImageNet for 100k iters each. The ensemble significantly outperforms the baseline during transfer.

Theorems & Definitions (2)

  • Proposition 2.1
  • proof