Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning
Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, Fei Huang
TL;DR
<3-5 sentence high-level summary> Child-Tuning tackles overfitting and poor generalization when fine-tuning very large pretrained language models on limited data. It updates only a subset of parameters—the child network—via a gradient-mask mechanism, while the full network participates in the forward pass. The authors present two variants: a task-free version CT$_F$ that relies on stochastic gradient masking, and a task-driven version CT$_D$ that selects parameters using diagonal Fisher information to maximize task relevance. Across GLUE tasks and domain-transfer scenarios, Child-Tuning yields substantial improvements over vanilla fine-tuning and outperforms or complements prior regularization and parameter-efficient methods, demonstrating strong generalization capabilities. The approach is simple, model-agnostic, and orthogonal to existing fine-tuning strategies, making it practical for broad adoption in deploying large PLMs.</br></br>Key mathematical ideas include the gradient-masking update rule $oldsymbol{w}_{t+1}=oldsymbol{w}_t-oldsymbol{ abla}oldsymbol{L}(oldsymbol{w}_t)igotimes oldsymbol{M}_t$ and the Fisher-information-based selection $F^{(i)}(oldsymbol{w})=rac{1}{|D|} ext{E}ig[(rac{oldsymbol{ abla} ext{log}p(y|x;oldsymbol{w})}{oldsymbol{ abla}oldsymbol{w}^{(i)}})^2ig]$ used to determine the task-driven child network.</br></br>
Abstract
Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, Child-Tuning, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that Child-Tuning consistently outperforms the vanilla fine-tuning by 1.5~8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6~1.3 points. Furthermore, empirical results on domain transfer and task transfer show that Child-Tuning can obtain better generalization performance by large margins.
