Table of Contents
Fetching ...

Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning

Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, Fei Huang

TL;DR

<3-5 sentence high-level summary> Child-Tuning tackles overfitting and poor generalization when fine-tuning very large pretrained language models on limited data. It updates only a subset of parameters—the child network—via a gradient-mask mechanism, while the full network participates in the forward pass. The authors present two variants: a task-free version CT$_F$ that relies on stochastic gradient masking, and a task-driven version CT$_D$ that selects parameters using diagonal Fisher information to maximize task relevance. Across GLUE tasks and domain-transfer scenarios, Child-Tuning yields substantial improvements over vanilla fine-tuning and outperforms or complements prior regularization and parameter-efficient methods, demonstrating strong generalization capabilities. The approach is simple, model-agnostic, and orthogonal to existing fine-tuning strategies, making it practical for broad adoption in deploying large PLMs.</br></br>Key mathematical ideas include the gradient-masking update rule $oldsymbol{w}_{t+1}=oldsymbol{w}_t-oldsymbol{ abla}oldsymbol{L}(oldsymbol{w}_t)igotimes oldsymbol{M}_t$ and the Fisher-information-based selection $F^{(i)}(oldsymbol{w})= rac{1}{|D|} ext{E}ig[( rac{oldsymbol{ abla} ext{log}p(y|x;oldsymbol{w})}{oldsymbol{ abla}oldsymbol{w}^{(i)}})^2ig]$ used to determine the task-driven child network.</br></br>

Abstract

Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, Child-Tuning, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that Child-Tuning consistently outperforms the vanilla fine-tuning by 1.5~8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6~1.3 points. Furthermore, empirical results on domain transfer and task transfer show that Child-Tuning can obtain better generalization performance by large margins.

Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning

TL;DR

<3-5 sentence high-level summary> Child-Tuning tackles overfitting and poor generalization when fine-tuning very large pretrained language models on limited data. It updates only a subset of parameters—the child network—via a gradient-mask mechanism, while the full network participates in the forward pass. The authors present two variants: a task-free version CT that relies on stochastic gradient masking, and a task-driven version CT that selects parameters using diagonal Fisher information to maximize task relevance. Across GLUE tasks and domain-transfer scenarios, Child-Tuning yields substantial improvements over vanilla fine-tuning and outperforms or complements prior regularization and parameter-efficient methods, demonstrating strong generalization capabilities. The approach is simple, model-agnostic, and orthogonal to existing fine-tuning strategies, making it practical for broad adoption in deploying large PLMs.</br></br>Key mathematical ideas include the gradient-masking update rule and the Fisher-information-based selection used to determine the task-driven child network.</br></br>

Abstract

Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, Child-Tuning, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that Child-Tuning consistently outperforms the vanilla fine-tuning by 1.5~8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6~1.3 points. Furthermore, empirical results on domain transfer and task transfer show that Child-Tuning can obtain better generalization performance by large margins.

Paper Structure

This paper contains 37 sections, 5 theorems, 30 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Suppose $\mathcal{L}$ denotes the loss function on the parameter $\mathbf{w}$, the gradients obey a Gaussian distribution $\mathcal{N}(\frac{\partial \mathcal{L}}{\partial \mathbf{w}},\sigma^2_\mathbf{g}\mathbf{I}_k)$, and SGD with learning rate $\eta$ is used. For a randomly sampled batch $\mathcal Specially, when $\mathbf{w}$ is a local minima, $\mathbb{E}[\mathbf{\Delta w}]=\mathbf{0}_k, \Sigma

Figures (3)

  • Figure 1: The illustration of Child-Tuning. Left: It forwards on the whole network while backwarding on a subset of network (i.e., child network). Right: To achieve this, a task-free or task-driven mask is performed on the gradients of the non-child network, resetting them to zero (grey diagonal grids).
  • Figure 2: Probing task generalization. The model is fine-tuned on MRPC task and transferred to four different tasks. Child-Tuning can maintain more generalizable representations compared with vanilla fine-tuning.
  • Figure 3: The overlapping ratio among task-driven child networks among GLUE tasks.

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Theorem 1
  • Theorem 2
  • proof
  • proof
  • Lemma 1