Boosting Large Language Models with Mask Fine-Tuning
Mingyuan Zhang, Yue Bai, Huan Wang, Yizhou Wang, Qihua Dong, Yun Fu
TL;DR
The paper questions whether preserving full LLM structural integrity is necessary during fine-tuning and introduces Mask Fine-Tuning (MFT), which freezes a well-tuned model and learns a binary mask to remove certain weights. Guided by standard fine-tuning objectives, MFT employs a straight-through estimator to optimize the mask, achieving performance gains that exceed the FFT upper bound across multiple backbones and domains. Through extensive experiments and analyses, including layer-group ablations, masking-ratio studies, and data-ratio investigations, MFT demonstrates consistent improvements and provides a new, practical post-finetuning protocol. By extending mask learning beyond pruning, the work offers a general approach to enhance LLM performance within existing fine-tuning pipelines and highlights the potential of sparsity-driven augmentation in large-scale models.
Abstract
The model is usually kept integral in the mainstream large language model (LLM) fine-tuning protocols. No works have questioned whether maintaining the integrity of the model is indispensable for performance. In this work, we introduce Mask Fine-Tuning (MFT), a brand-new LLM fine-tuning paradigm to show that properly breaking the integrity of the model can surprisingly lead to improved performance. Specifically, MFT learns a set of binary masks supervised by the typical LLM fine-tuning objective. Extensive experiments show that MFT gains a consistent performance boost across various domains and backbones (e.g., 1.95%/1.88% average gain in coding with LLaMA2-7B/3.1-8B). Detailed procedures are provided to study the proposed MFT from different hyperparameter perspectives for better insight. In particular, MFT naturally updates the current LLM training protocol by deploying it on a complete well-trained model. This study extends the functionality of mask learning from its conventional network pruning context for model compression to a more general scope.
