Table of Contents
Fetching ...

Towards Understanding Feature Learning in Parameter Transfer

Hua Yuan, Xuran Meng, Qiufeng Wang, Shiyu Xia, Ning Xu, Xu Yang, Jing Wang, Xin Geng, Yong Rui

TL;DR

This theory is the first to provide a dynamic analysis for parameter transfer and also the first to prove the existence of negative transfer theoretically.

Abstract

Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. To our best knowledge, our theory is the first to provide a dynamic analysis for parameter transfer and also the first to prove the existence of negative transfer theoretically. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.

Towards Understanding Feature Learning in Parameter Transfer

TL;DR

This theory is the first to provide a dynamic analysis for parameter transfer and also the first to prove the existence of negative transfer theoretically.

Abstract

Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. To our best knowledge, our theory is the first to provide a dynamic analysis for parameter transfer and also the first to prove the existence of negative transfer theoretically. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.

Paper Structure

This paper contains 19 sections, 31 theorems, 166 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Theorem 4.2

Suppose that percentage $\alpha$ ($0<\alpha \leq 1$) of the upstream model's weights are inherited. For any $\varepsilon, \delta > 0$, if Condition condition:4.1 holds, then there exist constants $C_1, C_2, C_3 > 0$, such that with probability at least $1 - 2\delta$, the following results hold at $T

Figures (4)

  • Figure 1: Test accuracy under varying conditions of the source task. "w/o PT" corresponds to standard training without parameter transfer. We compare three key factors that influence the effectiveness of parameter transfer: (a) training sample size of Task 1 $N_1$; (b) the noise level of Task 1; (c) the universal signal strength $\|\mathbf{u}\|_2$ while fixing $\|\mathbf{u}+\mathbf{v}_2\|_2$. All scenarios include a baseline setting without parameter transfer.
  • Figure 2: (a) is the heatmap of test accuracy under different dimensions $d$ and the universal signal strength $\|\mathbf{u}\|_2$ with fixex $\|\mathbf{u}+\mathbf{v}_2\|_2$. The x-axis is the value of $\|\mathbf{u}\|_2$ and the y-axis is the dimension $d$. (b) and (c) display the truncated heatmap of test accuracy. The accuracy smaller than 0.65 (0.70) is set as 0 (yellow) and the other is set as 1 (blue).
  • Figure 3: Effect of varying $\sigma_{p,2}$ on CIFAR-10 and CIFAR-100. Test accuracy of ResNet-34 and ResNet-50 as downstream models on (a) CIFAR-10 and (b) CIFAR-100 under different noise level $\sigma_{p,2}$. "w/" and "w/o" denote models trained with and without parameter transfer, respectively.
  • Figure 4: We adapt ViT models as the upstream model and downstream models. The upstream model is pre-trained on ImageNet-1K and the downstream models are trained on CIFAR-10 and CIFAR-100, separately.

Theorems & Definitions (34)

  • Definition 3.1: Data in Task 1
  • Definition 3.2: Data in Task 2
  • Theorem 4.2: With parameter transfer
  • Theorem 4.3: Without parameter transfer, Previous results in kou2023benign
  • Proposition 4.4
  • Lemma A.1
  • Lemma A.2
  • Definition B.1
  • Lemma B.2: Update Rule
  • Lemma C.1
  • ...and 24 more