Provable Target Sample Complexity Improvements as Pre-Trained Models Scale
Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui
TL;DR
The paper proposes caulking, a theoretical framework to explain why larger pre-trained models reduce downstream sample complexity. By formalizing a pre-trained model as a head and feature extractor with an adapter (the caulk), the authors show that if the adapter class becomes simpler as the source size grows, one can achieve faster downstream error rates, up to an $n^{-{2\alpha\beta}/(2\alpha\beta+1)}$ scaling with an $o(1)$ term depending on the source size $m$. The key contribution is the (i) caulkability condition linking the ideal regressor to a pre-trained structure, (ii) an error bound that ties adapter complexity, Hölder smoothness, and sample sizes to improved rates, (iii) concrete examples in compositional spaces illustrating the potential rate improvements, and (iv) empirical evidence in CNN fine-tuning and vision-language models aligning with theory. The work provides a principled explanation for data-scale laws observed with pre-trained models and highlights a practical direction for designing PEFT adapters that enable scalable adaptation. An open question remains on constructing pre-trained models that inherently exhibit caulkability through training strategies.
Abstract
Pre-trained models have become indispensable for efficiently building models across a broad spectrum of downstream tasks. The advantages of pre-trained models have been highlighted by empirical studies on scaling laws, which demonstrate that larger pre-trained models can significantly reduce the sample complexity of downstream learning. However, existing theoretical investigations of pre-trained models lack the capability to explain this phenomenon. In this paper, we provide a theoretical investigation by introducing a novel framework, caulking, inspired by parameter-efficient fine-tuning (PEFT) methods such as adapter-based fine-tuning, low-rank adaptation, and partial fine-tuning. Our analysis establishes that improved pre-trained models provably decrease the sample complexity of downstream tasks, thereby offering theoretical justification for the empirically observed scaling laws relating pre-trained model size to downstream performance, a relationship not covered by existing results.
