Table of Contents
Fetching ...

Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

Yang Tang, Ruijie Liu, Yifan Wang, Shiyu Li, Xi Chen

TL;DR

This work tackles catastrophic forgetting and the heavy cost of data-mixed fine-tuning in large language models by proposing Dynamic Boosted Annealing (DBA). DBA decouples general-domain learning from domain-specific adaptation via Global Gradient Boosted learning (GGB), Dynamic Correction (DC), and Annealing Learning (AL), using a fixed global gradient ${\hat{g}_G}$ and a gradient similarity-based update strategy. Empirical results across finance, medicine, law, and News QA—on multiple base models—show DBA provides superior joint performance (domain plus general) while cutting GPU hours by about 91% compared with vanilla fine-tuning. The approach offers a practical, scalable pathway to domain-specific fine-tuning without repeated data-mixing experiments, and the authors provide open-source tooling for easy adoption. Overall, DBA demonstrates a robust balance between specialization and generality, with strong potential for broad deployment in domain-adaptive fine-tuning scenarios.

Abstract

Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.

Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

TL;DR

This work tackles catastrophic forgetting and the heavy cost of data-mixed fine-tuning in large language models by proposing Dynamic Boosted Annealing (DBA). DBA decouples general-domain learning from domain-specific adaptation via Global Gradient Boosted learning (GGB), Dynamic Correction (DC), and Annealing Learning (AL), using a fixed global gradient and a gradient similarity-based update strategy. Empirical results across finance, medicine, law, and News QA—on multiple base models—show DBA provides superior joint performance (domain plus general) while cutting GPU hours by about 91% compared with vanilla fine-tuning. The approach offers a practical, scalable pathway to domain-specific fine-tuning without repeated data-mixing experiments, and the authors provide open-source tooling for easy adoption. Overall, DBA demonstrates a robust balance between specialization and generality, with strong potential for broad deployment in domain-adaptive fine-tuning scenarios.

Abstract

Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.

Paper Structure

This paper contains 21 sections, 13 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison between vanilla and DBA. [*] is the part that users need to perform in SFT.
  • Figure 2: Qualitative and quantitative analysis result of general and specific domain.
  • Figure 3: Overview of Dynamic Boosted Annealing. Our approach consists of two stages. In the first stage, global gradient is estimated in the general domain through zero-learning-rate learning, which serves as an independent preprocessing stage. In the second stage, the fine-tuning step, global gradient boosts the specific gradient to preserve general capability, while the similarity between global and specific gradients adaptively determines the parameter update magnitude. The learning rate with annealing strategy suppresses degradation.
  • Figure :