Table of Contents
Fetching ...

REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning

Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon

TL;DR

Continual learning on edge devices is hindered by the resource demands of prompt-based rehearsal-free methods. REP mitigates this by using a lightweight surrogate for prompt selection and two adaptive update techniques—adaptive token merging and adaptive layer dropping—to reduce compute and memory while preserving task-specific knowledge. The approach yields substantial reductions in training time and memory across multiple datasets and ViT backbones, with only marginal accuracy loss, and extends to non-prompting methods and adapters. Overall, REP enables practical, on-device continual learning with vision transformers by delivering substantial efficiency gains without compromising core performance.

Abstract

Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.

REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning

TL;DR

Continual learning on edge devices is hindered by the resource demands of prompt-based rehearsal-free methods. REP mitigates this by using a lightweight surrogate for prompt selection and two adaptive update techniques—adaptive token merging and adaptive layer dropping—to reduce compute and memory while preserving task-specific knowledge. The approach yields substantial reductions in training time and memory across multiple datasets and ViT backbones, with only marginal accuracy loss, and extends to non-prompting methods and adapters. Overall, REP enables practical, on-device continual learning with vision transformers by delivering substantial efficiency gains without compromising core performance.

Abstract

Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP's superior resource efficiency over state-of-the-art rehearsal-free CL methods.
Paper Structure (37 sections, 18 equations, 7 figures, 23 tables, 2 algorithms)

This paper contains 37 sections, 18 equations, 7 figures, 23 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of the proposed resource-efficient prompting (REP) algorithm for rehearsal-free CL. REP calculates query features from input samples using a lightweight surrogate model (e.g., ViT-Ti) and random projections to swiftly extract prompts from the prompt pool. These prompts are then inserted into a main backbone model (e.g., ViT-L) for training, which prioritizes model accuracy.
  • Figure 2: Mean attention distances for frozen blocks along (a) layers and (b/c) attention heads. We run the first task of Split ImageNet-R. (a/b) L2P with ViT-L, and (c) DualPrompt with ViT-B.
  • Figure 3: The norm of gradient with respect to the prompt during training Split ImageNet-R (10 tasks) when AToM and ToMe (Conventional token merging) are applied to L2P with ViT-L.
  • Figure 4: Comparing various layer-dropping strategies using L2P with the ViT-L backbone on Split ImageNet-R (10 tasks). The bar and marker are GPU time and final average accuracy, respectively.
  • Figure 5: Cost-accuracy trade-offs of various ViT- and CNN-based methods over three different memory budgets: up to 1GB, 1--4GB, and 4--8GB. The memory breakdown of each method is in the first row. Experiments on Split CIFAR-100, Split ImageNet-R, and Split PlantDisease are on the second, third, and fourth row, respectively. ViT-based methods consistently outperform CNN-based methods by a large margin.
  • ...and 2 more figures