Table of Contents
Fetching ...

HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning

Liyuan Wang, Jingyi Xie, Xingxing Zhang, Hang Su, Jun Zhu

TL;DR

The paper addresses catastrophic forgetting in continual learning (CL) when using pre-trained models and parameter-efficient tuning (PET) by introducing HiDe-PET, a framework that decomposes the CL objective into three hierarchical components: within-task prediction (WTP), task-identity inference (TII), and task-adaptive prediction (TAP). HiDe-PET optimizes WTP with task-specific PET modules, uses task-shared PET and distribution recovery to improve TII, and leverages representation recovery for TAP, enabling effective knowledge transfer while preserving pre-trained knowledge. The authors provide theoretical results linking the decomposed components to CL performance and to OOD detection, and demonstrate that LoRA/Adapter-based PET generally outperforms prompt-based PET across diverse PTMs and tasks. Empirically, HiDe-PET achieves superior FAA and CAA on four CL benchmarks across multiple checkpoints, with adaptive knowledge accumulation further enhancing performance under distribution shifts. Overall, the approach offers a scalable, generalizable pathway to deploy CL with frozen backbones and lightweight PET across realistic, dynamic task streams.

Abstract

The deployment of pre-trained models (PTMs) has greatly advanced the field of continual learning (CL), enabling positive knowledge transfer and resilience to catastrophic forgetting. To sustain these advantages for sequentially arriving tasks, a promising direction involves keeping the pre-trained backbone frozen while employing parameter-efficient tuning (PET) techniques to instruct representation learning. Despite the popularity of Prompt-based PET for CL, its empirical design often leads to sub-optimal performance in our evaluation of different PTMs and target tasks. To this end, we propose a unified framework for CL with PTMs and PET that provides both theoretical and empirical advancements. We first perform an in-depth theoretical analysis of the CL objective in a pre-training context, decomposing it into hierarchical components namely within-task prediction, task-identity inference and task-adaptive prediction. We then present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimizes the decomposed objective through incorporating task-specific and task-shared knowledge via mainstream PET techniques along with efficient recovery of pre-trained representations. Leveraging this framework, we delve into the distinct impacts of implementation strategy, PET technique and PET architecture, as well as adaptive knowledge accumulation amidst pronounced distribution changes. Finally, across various CL scenarios, our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines.

HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning

TL;DR

The paper addresses catastrophic forgetting in continual learning (CL) when using pre-trained models and parameter-efficient tuning (PET) by introducing HiDe-PET, a framework that decomposes the CL objective into three hierarchical components: within-task prediction (WTP), task-identity inference (TII), and task-adaptive prediction (TAP). HiDe-PET optimizes WTP with task-specific PET modules, uses task-shared PET and distribution recovery to improve TII, and leverages representation recovery for TAP, enabling effective knowledge transfer while preserving pre-trained knowledge. The authors provide theoretical results linking the decomposed components to CL performance and to OOD detection, and demonstrate that LoRA/Adapter-based PET generally outperforms prompt-based PET across diverse PTMs and tasks. Empirically, HiDe-PET achieves superior FAA and CAA on four CL benchmarks across multiple checkpoints, with adaptive knowledge accumulation further enhancing performance under distribution shifts. Overall, the approach offers a scalable, generalizable pathway to deploy CL with frozen backbones and lightweight PET across realistic, dynamic task streams.

Abstract

The deployment of pre-trained models (PTMs) has greatly advanced the field of continual learning (CL), enabling positive knowledge transfer and resilience to catastrophic forgetting. To sustain these advantages for sequentially arriving tasks, a promising direction involves keeping the pre-trained backbone frozen while employing parameter-efficient tuning (PET) techniques to instruct representation learning. Despite the popularity of Prompt-based PET for CL, its empirical design often leads to sub-optimal performance in our evaluation of different PTMs and target tasks. To this end, we propose a unified framework for CL with PTMs and PET that provides both theoretical and empirical advancements. We first perform an in-depth theoretical analysis of the CL objective in a pre-training context, decomposing it into hierarchical components namely within-task prediction, task-identity inference and task-adaptive prediction. We then present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimizes the decomposed objective through incorporating task-specific and task-shared knowledge via mainstream PET techniques along with efficient recovery of pre-trained representations. Leveraging this framework, we delve into the distinct impacts of implementation strategy, PET technique and PET architecture, as well as adaptive knowledge accumulation amidst pronounced distribution changes. Finally, across various CL scenarios, our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines.
Paper Structure (24 sections, 9 theorems, 53 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 9 theorems, 53 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

For continual learning (CL) in a pre-training context, if $\mathbb{E}_{\boldsymbol{x}} [{H}_{\rm{WTP}}(\boldsymbol{x})] \leq \delta$, $\mathbb{E}_{\boldsymbol{x}} [{H}_{\rm{TII}}(\boldsymbol{x})] \leq \epsilon$, and $\mathbb{E}_{\boldsymbol{x}} [{H}_{\rm{TAP}}(\boldsymbol{x})] \leq \eta$, we have th

Figures (7)

  • Figure 1: Summary of CL performance with different PET techniques. We compare our HiDe-PET, our preliminary version wang2023hierarchical and LAE gao2023unified, and report the final average accuracy (FAA) over three pre-trained checkpoints and four CL benchmarks.
  • Figure 2: Implementation of PET techniques for representation learning. These PET techniques all mount to modulating the (intermediate) representations of the backbone and ensure lightweight implementations.
  • Figure 3: Illustration of HiDe-PET. See Fig. \ref{['PET_Techniques']} for detailed implementations of PET techniques and the frozen transformer backbone. Here we have an example of 2 tasks with 2 classes each. WTP aims to classify each 2 classes well, optimized by the PET ensemble of task-specific parameters. TII aims to select appropriately one of the 2 tasks, optimized by the task-shared parameters and the recovery of uninstructed representations. TAP aims to classify the total of 4 classes well, optimized by the recovery of instructed representations upon WTP and TII.
  • Figure 4: Inference of HiDe-PET during the testing phase. HiDe-PET first employs the task-shared parameters and the auxiliary output layer to infer task identity, and then employs the corresponding task-specific parameters and the output layer to obtain final prediction.
  • Figure 5: Adaptive knowledge accumulation. HiDe-PET employs OOD detection to decide whether to expand a new set of parameters or to retrieve a previously learned set of parameters. Such parameters are further specified into LoRA-based PET to update the backbone.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Theorem 9