Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

Kazuto Fukuchi; Ryuichiro Hataya; Kota Matsui

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui

TL;DR

The paper proposes caulking, a theoretical framework to explain why larger pre-trained models reduce downstream sample complexity. By formalizing a pre-trained model as a head and feature extractor with an adapter (the caulk), the authors show that if the adapter class becomes simpler as the source size grows, one can achieve faster downstream error rates, up to an $n^{-{2\alpha\beta}/(2\alpha\beta+1)}$ scaling with an $o(1)$ term depending on the source size $m$. The key contribution is the (i) caulkability condition linking the ideal regressor to a pre-trained structure, (ii) an error bound that ties adapter complexity, Hölder smoothness, and sample sizes to improved rates, (iii) concrete examples in compositional spaces illustrating the potential rate improvements, and (iv) empirical evidence in CNN fine-tuning and vision-language models aligning with theory. The work provides a principled explanation for data-scale laws observed with pre-trained models and highlights a practical direction for designing PEFT adapters that enable scalable adaptation. An open question remains on constructing pre-trained models that inherently exhibit caulkability through training strategies.

Abstract

Pre-trained models have become indispensable for efficiently building models across a broad spectrum of downstream tasks. The advantages of pre-trained models have been highlighted by empirical studies on scaling laws, which demonstrate that larger pre-trained models can significantly reduce the sample complexity of downstream learning. However, existing theoretical investigations of pre-trained models lack the capability to explain this phenomenon. In this paper, we provide a theoretical investigation by introducing a novel framework, caulking, inspired by parameter-efficient fine-tuning (PEFT) methods such as adapter-based fine-tuning, low-rank adaptation, and partial fine-tuning. Our analysis establishes that improved pre-trained models provably decrease the sample complexity of downstream tasks, thereby offering theoretical justification for the empirically observed scaling laws relating pre-trained model size to downstream performance, a relationship not covered by existing results.

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

TL;DR

scaling with an

term depending on the source size

. The key contribution is the (i) caulkability condition linking the ideal regressor to a pre-trained structure, (ii) an error bound that ties adapter complexity, Hölder smoothness, and sample sizes to improved rates, (iii) concrete examples in compositional spaces illustrating the potential rate improvements, and (iv) empirical evidence in CNN fine-tuning and vision-language models aligning with theory. The work provides a principled explanation for data-scale laws observed with pre-trained models and highlights a practical direction for designing PEFT adapters that enable scalable adaptation. An open question remains on constructing pre-trained models that inherently exhibit caulkability through training strategies.

Abstract

Paper Structure (48 sections, 15 theorems, 90 equations, 4 figures)

This paper contains 48 sections, 15 theorems, 90 equations, 4 figures.

Introduction
Contributions.
Related Work
Domain Adaptation Theory.
Deep Learning Theory for Pre-trained Models.
Large Language Models.
Learning with Pre-trained Model via Caulking
Notations.
Problem Setup.
Motivating Example.
Caulking.
Empirical Caulking.
Error Analysis of Empirical Caulking
Assumption.
Complexity of $\mathcal{G}$.
...and 33 more sections

Key Result

Theorem 1

Let $\alpha \in (0,1]$ and $\beta > 0$. Suppose that $f^*$ is $(n^{-\frac{2\alpha\beta}{2\alpha\beta+1}}, \mathcal{G})$-caulkable by $f_{\mathrm{pre}}=(g_h, g_e)$ for some $\beta$-complex class $\mathcal{G}$, and $g_h$ is $\alpha$-Hölder continuous. Let $f_n$ denote the estimated regressor obtained for some constant $C > 0$.

Figures (4)

Figure 1: A conceptual illustration of caulking. Blue boxes represent pre-trained models, and red boxes represent underlying functions. The horizontal axis represents the source sample size $m$, which corresponds to the scale of the pre-trained model.
Figure 2: A motivating example illustrating the successful utilization of a pre-trained model.
Figure 3: The relationship between the depth of adapters and the error rate on the target domain. Minimum error rates for each model are marked by $\star$.
Figure 4: The relationship between the depth of adapters and the error rate on the MMStar dataset chen2024are.

Theorems & Definitions (26)

Definition 1: Caulkability
Remark 1
Definition 2: $\beta$-complexity
Definition 3: Hölder continuity
Theorem 1
Corollary 1
Theorem 2: schmidt-hieberNonparametricRegressionUsing2020
Theorem 3
Theorem 4: schmidt-hieberNonparametricRegressionUsing2020hayakawaMinimaxOptimalitySuperiority2020
Proposition 1
...and 16 more

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

TL;DR

Abstract

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (26)