Table of Contents
Fetching ...

Parameter Importance-Driven Continual Learning for Foundation Models

Lingxiang Wang, Hainan Zhang, Zhiming Zheng

TL;DR

The paper tackles catastrophic forgetting during domain-specific post-training of large foundation models. It introduces PIECE, a parameter-importance driven continual enhancement method that updates only a tiny fraction of parameters (0.1%) per task, guided by two estimators: PIECE-F based on Fisher information and PIECE-S based on a second-order normalization that fuses gradient and curvature signals. PIECE operates under a no-history, no-architecture-change assumption and yields state-of-the-art continual learning performance across multiple language and multimodal models while preserving core capabilities such as programming and image captioning. The approach demonstrates robust, scalable domain adaptation with strong transfer and minimal forgetting, highlighting a practical path to sustainable continual learning in large, real-world models.

Abstract

Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.

Parameter Importance-Driven Continual Learning for Foundation Models

TL;DR

The paper tackles catastrophic forgetting during domain-specific post-training of large foundation models. It introduces PIECE, a parameter-importance driven continual enhancement method that updates only a tiny fraction of parameters (0.1%) per task, guided by two estimators: PIECE-F based on Fisher information and PIECE-S based on a second-order normalization that fuses gradient and curvature signals. PIECE operates under a no-history, no-architecture-change assumption and yields state-of-the-art continual learning performance across multiple language and multimodal models while preserving core capabilities such as programming and image captioning. The approach demonstrates robust, scalable domain adaptation with strong transfer and minimal forgetting, highlighting a practical path to sustainable continual learning in large, real-world models.

Abstract

Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.

Paper Structure

This paper contains 13 sections, 31 equations, 19 figures, 4 tables, 1 algorithm.

Figures (19)

  • Figure 1: (a) Average downstream-task scores and (b) HumanEval Programming ability (Pass@K, K=1) of Llama3-8B on the TRACE benchmark as continual learning tasks increase. PIECE consistently outperforms full fine-tuning (SeqFT), regularization (EWC, GEM, LwF), replay (Replay, Replay-online), and PET (SeqLoRA, O-LoRA, LayerNorm, MIGU) baselines in both downstream performance and capability retention.
  • Figure 2: The illustration of PIECE. During parameter updates, PIECE performs standard ① forward and ② backward steps, but before ④ updating, it ③ applies a fixed gradient mask (computed from PIECE-F/S parameter importance) to protect most parameters and update only top-k task-relevant ones.
  • Figure 3: Distribution of critical parameters identified by PIECE-F and PIECE-S across different tasks.
  • Figure 4: Visualization of intermediate and upper layer task representations comparing the base model, full fine-tuning, and PIECE.
  • Figure 5: Parameter overlap across tasks.
  • ...and 14 more figures