These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining
Xingyu Alice Yang, Jianyu Zhang, Léon Bottou
TL;DR
The paper investigates why pretraining on a broad mixture of tasks does not guarantee parity with task-specific training, identifying a fundamental information-saturation bottleneck caused by sparsity bias in deep networks. It analyzes transfer via linear probing and neural tangent kernel features, provides a theoretical counterexample, and surveys empirical evidence of feature loss across spurious and domain-specific settings. To mitigate the bottleneck, the authors propose time-concatenation ensembles that create richer representations within a fixed compute budget, achieving measurable gains in transfer to unseen distributions while maintaining similar performance on in-distribution data. The results suggest that richer, more diverse representations—via ensembles or targeted pretraining—can substantially boost transfer learning, challenging the notion that ever-larger mono-model foundations alone deliver robust generalization.
Abstract
Transfer learning is widely used to adapt large pretrained models to new tasks with only a small amount of new data. However, a challenge persists -- the features from the original task often do not fully cover what is needed for unseen data, especially when the relatedness of tasks is not clear. Since deep learning models tend to learn very sparse representations, they retain only the minimal features required for the initial training while discarding potentially ones for downstream transfer. A theoretical framework developed in this work demonstrates that such pretraining captures inconsistent aspects of the data distribution, therefore, inducing transfer bias. To address this limitation, we propose an inexpensive ensembling strategy that aggregates multiple models to generate richer feature representations. On ResNet, this approach yields a $9\%$ improvement in transfer accuracy without incurring extra pretraining cost. We also present empirical evidence from a range of deep learning studies, confirming that the phenomenon is pervasive across modern deep learning architectures. These results suggests that relying solely on large pretrained networks is not always the most effective way to improve model generalization. Instead, fostering richer, more diverse representations -- e.g. - through model ensembles -- can substantially enhance transfer learning performance.
