Table of Contents
Fetching ...

ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

Jinyang Zhang, Yue Fang, Hongxin Ding, Weibin Liao, Muyang Ye, Xu Chu, Junfeng Zhao, Yasha Wang

TL;DR

The paper tackles catastrophic forgetting in domain-adaptive continual pretraining of large language models. It introduces ADEPT, a two-stage method that first performs general-competence guided selective layer expansion and then applies adaptive unit-wise decoupled tuning with asymmetric learning rates. Across mathematics and medicine benchmarks, ADEPT yields up to $5.58\%$ improvements on target-domain tasks and $5.76\%$ on general benchmarks while tuning only $15\%$ of parameters and reducing training time, supported by ablations and theoretical bounds. The approach demonstrates that exploiting functional specialization and targeted parameter decoupling can achieve robust, efficient domain adaptation with strong knowledge retention, and the authors provide open-source code for reproducibility.

Abstract

Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical benchmarks show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general domain and 5.58% on the target domain with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://github.com/PuppyKnightUniversity/ADEPT

ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

TL;DR

The paper tackles catastrophic forgetting in domain-adaptive continual pretraining of large language models. It introduces ADEPT, a two-stage method that first performs general-competence guided selective layer expansion and then applies adaptive unit-wise decoupled tuning with asymmetric learning rates. Across mathematics and medicine benchmarks, ADEPT yields up to improvements on target-domain tasks and on general benchmarks while tuning only of parameters and reducing training time, supported by ablations and theoretical bounds. The approach demonstrates that exploiting functional specialization and targeted parameter decoupling can achieve robust, efficient domain adaptation with strong knowledge retention, and the authors provide open-source code for reproducibility.

Abstract

Conventional continual pretraining (CPT) for large language model (LLM) domain adaptation often suffers from catastrophic forgetting and limited domain capacity. Existing strategies adopt layer expansion, introducing additional trainable parameters to accommodate new knowledge. However, the uniform expansion and updates still entangle general and domain learning, undermining its effectiveness. Our pilot studies reveal that LLMs exhibit functional specialization, where layers and units differentially encode general-critical capabilities, suggesting that parameter expansion and optimization should be function-aware. We then propose ADEPT, Adaptive Expansion and Dynamic Decoupled Tuning for continual pretraining, a two-stage framework for domain-adaptive CPT. ADEPT first performs General-Competence Guided Selective Layer Expansion, duplicating layers least critical for the general domain to increase representational capacity while minimizing interference with general knowledge. It then applies Adaptive Unit-Wise Decoupled Tuning, disentangling parameter units within expanded layers according to their general-domain importance and assigning asymmetric learning rates to balance knowledge injection and retention. Experiments on mathematical and medical benchmarks show that ADEPT outperforms full-parameter CPT by up to 5.76% on the general domain and 5.58% on the target domain with only 15% of parameters tuned and less than 50% training time. Ablation studies, theoretical analysis, and extended investigations further demonstrate the necessity of targeted expansion and decoupled optimization, providing new principles for efficient and robust domain-adaptive CPT. Our code is open-sourced at https://github.com/PuppyKnightUniversity/ADEPT

Paper Structure

This paper contains 44 sections, 1 theorem, 41 equations, 15 figures, 12 tables.

Key Result

Theorem F.1

Let $S \subseteq [L]$ be the set of layers selected for expansion/adaptation, and $G(S)$ denote the source-domain generalization gap after adaptation, i.e., Under function-preserving initialization, limited adaptation steps, and $L$-Lipschitz and $\beta$-smooth loss, the following upper bound holds: where $C$ is a constant depending on the learning rate, steps, loss smoothness, and initializatio

Figures (15)

  • Figure 1: Illustration of the core idea of ADEPT. Target domain extension are applied on the least important region for general domain, minimizing catastrophic forgetting. Asymmetric learning rates are applied to parameter subsets for targeted knowledge injection.
  • Figure 2: Layer- and unit-level importance distribution of the Qwen3 family. The vertical axis corresponds to different layers, while the horizontal axis denotes parameter units within each layer. Deeper blue indicates higher importance for preserving general-domain competencies.
  • Figure 3: Illustration of ADEPT.
  • Figure 4: Activation distribution analysis of Qwen3-8B.
  • Figure 5: Token distribution shifts across domains. Word cloud visualizations of shifted tokens reveal that ADEPT achieves highly focused alignment, with most changes concentrated on domain-specific terminology.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem F.1: Upper Bound on Generalization Gap by Layer Importance
  • proof