Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models

Hengyuan Zhang; Yanru Wu; Dawei Li; Sak Yang; Rui Zhao; Yong Jiang; Fei Tan

Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models

Hengyuan Zhang, Yanru Wu, Dawei Li, Sak Yang, Rui Zhao, Yong Jiang, Fei Tan

TL;DR

This work tackles the CF risk in fine-tuning aligned LLMs for domain-specific speciality without sacrificing broad versatility. It introduces CoFiTune, a coarse-to-fine framework that first identifies a limited layer-range module (primarily FFN) to update and then applies a Fine-SoftMask to regulate updates at the unit level based on versatility importance. Through a Chinese CF setting and extensive experiments across multiple tasks and model scales, CoFiTune consistently outperforms full SFT and CF baselines in both speciality and versatility, with notable improvements in Uni scores. The work also analyzes module importance and proposes a speculative view of information flow in LLMs to explain the observed benefits, offering practical guidance for stable, targeted fine-tuning in large transformers.

Abstract

Aligned Large Language Models (LLMs) showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications. However, fine-tuning with extra data, a common practice to gain speciality, often leads to catastrophic forgetting (CF) of previously acquired versatility, hindering the model's performance across diverse tasks. In response to this challenge, we propose CoFiTune, a coarse to fine framework in an attempt to strike the balance between speciality and versatility. At the coarse-grained level, an empirical tree-search algorithm is utilized to pinpoint and update specific modules that are crucial for speciality, while keeping other parameters frozen; at the fine-grained level, a soft-masking mechanism regulates the update to the LLMs, mitigating the CF issue without harming speciality. In an overall evaluation of both speciality and versatility, CoFiTune consistently outperforms baseline methods across diverse tasks and model scales. Compared to the full-parameter SFT, CoFiTune leads to about 14% versatility improvement and marginal speciality loss on a 13B model. Lastly, based on further analysis, we provide a speculative insight into the information forwarding process in LLMs, which helps explain the effectiveness of the proposed method. The code is available at https://github.com/rattlesnakey/CoFiTune.

Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models

TL;DR

Abstract

Paper Structure (74 sections, 13 equations, 10 figures, 56 tables, 1 algorithm)

This paper contains 74 sections, 13 equations, 10 figures, 56 tables, 1 algorithm.

Introduction
Related Work
CF in LLM
Key Components in Transformer
The Framework
Task Formulation
Backbone Architecture
Coarse-grained Level
Fine-grained Level
Computing Importance of Units
Fine-SoftMask Mechanism
Experiment
Datasets and Experiment Settings
Datasets:
Experimental Setting:
...and 59 more sections

Figures (10)

Figure 1: An illustration of our objective: achieving effective speciality without significantly compromising versatility.
Figure 2: An illustration of our CoFiTune framework. $N$ denotes the number of layers. At the coarse-grained level, we pinpoint the module (e.g., FFN) within a defined layer range (e.g., 10th - 20th layers) that gains speciality effectively without harming versatility much. At the fine-grained level, we selectively update the parameters within the region identified at the coarse-grained level and leverage $\mathbf{I}_m$ to control their gradient flow.
Figure 3: The exploration process of Finance task in the 13B model. $N$ denotes the number of layers and in this case, $N=40$. For simplicity, we denote the model fine-tuned in our exploration as "layer range - module", e.g., the model fine-tuned with FFN module within the layer range $(10, 20]$ denoted as "$(10, 20]$ - FFN".
Figure 4: Spec. scores for Finance and Math tasks under the 13B model across modules trained in all layers.
Figure 5: An illustration of our speculation on the process of information forwarding.
...and 5 more figures

Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models

TL;DR

Abstract

Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)