Table of Contents
Fetching ...

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui

TL;DR

The paper proposes caulking, a theoretical framework to explain why larger pre-trained models reduce downstream sample complexity. By formalizing a pre-trained model as a head and feature extractor with an adapter (the caulk), the authors show that if the adapter class becomes simpler as the source size grows, one can achieve faster downstream error rates, up to an $n^{-{2\alpha\beta}/(2\alpha\beta+1)}$ scaling with an $o(1)$ term depending on the source size $m$. The key contribution is the (i) caulkability condition linking the ideal regressor to a pre-trained structure, (ii) an error bound that ties adapter complexity, Hölder smoothness, and sample sizes to improved rates, (iii) concrete examples in compositional spaces illustrating the potential rate improvements, and (iv) empirical evidence in CNN fine-tuning and vision-language models aligning with theory. The work provides a principled explanation for data-scale laws observed with pre-trained models and highlights a practical direction for designing PEFT adapters that enable scalable adaptation. An open question remains on constructing pre-trained models that inherently exhibit caulkability through training strategies.

Abstract

Pre-trained models have become indispensable for efficiently building models across a broad spectrum of downstream tasks. The advantages of pre-trained models have been highlighted by empirical studies on scaling laws, which demonstrate that larger pre-trained models can significantly reduce the sample complexity of downstream learning. However, existing theoretical investigations of pre-trained models lack the capability to explain this phenomenon. In this paper, we provide a theoretical investigation by introducing a novel framework, caulking, inspired by parameter-efficient fine-tuning (PEFT) methods such as adapter-based fine-tuning, low-rank adaptation, and partial fine-tuning. Our analysis establishes that improved pre-trained models provably decrease the sample complexity of downstream tasks, thereby offering theoretical justification for the empirically observed scaling laws relating pre-trained model size to downstream performance, a relationship not covered by existing results.

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

TL;DR

The paper proposes caulking, a theoretical framework to explain why larger pre-trained models reduce downstream sample complexity. By formalizing a pre-trained model as a head and feature extractor with an adapter (the caulk), the authors show that if the adapter class becomes simpler as the source size grows, one can achieve faster downstream error rates, up to an scaling with an term depending on the source size . The key contribution is the (i) caulkability condition linking the ideal regressor to a pre-trained structure, (ii) an error bound that ties adapter complexity, Hölder smoothness, and sample sizes to improved rates, (iii) concrete examples in compositional spaces illustrating the potential rate improvements, and (iv) empirical evidence in CNN fine-tuning and vision-language models aligning with theory. The work provides a principled explanation for data-scale laws observed with pre-trained models and highlights a practical direction for designing PEFT adapters that enable scalable adaptation. An open question remains on constructing pre-trained models that inherently exhibit caulkability through training strategies.

Abstract

Pre-trained models have become indispensable for efficiently building models across a broad spectrum of downstream tasks. The advantages of pre-trained models have been highlighted by empirical studies on scaling laws, which demonstrate that larger pre-trained models can significantly reduce the sample complexity of downstream learning. However, existing theoretical investigations of pre-trained models lack the capability to explain this phenomenon. In this paper, we provide a theoretical investigation by introducing a novel framework, caulking, inspired by parameter-efficient fine-tuning (PEFT) methods such as adapter-based fine-tuning, low-rank adaptation, and partial fine-tuning. Our analysis establishes that improved pre-trained models provably decrease the sample complexity of downstream tasks, thereby offering theoretical justification for the empirically observed scaling laws relating pre-trained model size to downstream performance, a relationship not covered by existing results.
Paper Structure (48 sections, 15 theorems, 90 equations, 4 figures)

This paper contains 48 sections, 15 theorems, 90 equations, 4 figures.

Key Result

Theorem 1

Let $\alpha \in (0,1]$ and $\beta > 0$. Suppose that $f^*$ is $(n^{-\frac{2\alpha\beta}{2\alpha\beta+1}}, \mathcal{G})$-caulkable by $f_{\mathrm{pre}}=(g_h, g_e)$ for some $\beta$-complex class $\mathcal{G}$, and $g_h$ is $\alpha$-Hölder continuous. Let $f_n$ denote the estimated regressor obtained for some constant $C > 0$.

Figures (4)

  • Figure 1: A conceptual illustration of caulking. Blue boxes represent pre-trained models, and red boxes represent underlying functions. The horizontal axis represents the source sample size $m$, which corresponds to the scale of the pre-trained model.
  • Figure 2: A motivating example illustrating the successful utilization of a pre-trained model.
  • Figure 3: The relationship between the depth of adapters and the error rate on the target domain. Minimum error rates for each model are marked by $\star$.
  • Figure 4: The relationship between the depth of adapters and the error rate on the MMStar dataset chen2024are.

Theorems & Definitions (26)

  • Definition 1: Caulkability
  • Remark 1
  • Definition 2: $\beta$-complexity
  • Definition 3: Hölder continuity
  • Theorem 1
  • Corollary 1
  • Theorem 2: schmidt-hieberNonparametricRegressionUsing2020
  • Theorem 3
  • Theorem 4: schmidt-hieberNonparametricRegressionUsing2020hayakawaMinimaxOptimalitySuperiority2020
  • Proposition 1
  • ...and 16 more