Table of Contents
Fetching ...

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

TL;DR

GRID tackles latent forgetting and unbounded prompt growth in task-agnostic prompt-based continual learning for LLMs by combining constrained decoding with gradient-guided prompt compression. It introduces representative input sampling and task identification to stabilize decoding, and a gradient-based mechanism to prune and aggregate prompts, maintaining a compact yet informative prompt pool. Empirical results across long sequences and negative-transfer benchmarks show substantial improvements in backward transfer and memory efficiency, with competitive forward transfer and scalability to large models. The approach enables robust, privacy-conscious continual learning without relying on explicit task IDs, making it practically impactful for real-world deployment of language models.

Abstract

Prompt-based continual learning (CL) provides a parameter-efficient approach for adapting large language models (LLMs) across task sequences. However, most existing methods rely on task-aware inference and maintain a growing set of task-specific prompts, which introduces two major challenges: (1) severe performance degradation on earlier tasks under task-agnostic inference, and (2) limited scalability due to prompt memory accumulation as task sequences grow. In this paper, we present GRID, a unified framework designed to address these challenges. GRID incorporates a decoding mechanism that enhances backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Furthermore, it employs a gradient-guided prompt selection strategy to compress less informative prompts into a single aggregated representation, ensuring scalable and memory-efficient continual learning. Extensive experiments on long-sequence and negative transfer benchmarks show that GRID improves average accuracy and backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory usage.

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

TL;DR

GRID tackles latent forgetting and unbounded prompt growth in task-agnostic prompt-based continual learning for LLMs by combining constrained decoding with gradient-guided prompt compression. It introduces representative input sampling and task identification to stabilize decoding, and a gradient-based mechanism to prune and aggregate prompts, maintaining a compact yet informative prompt pool. Empirical results across long sequences and negative-transfer benchmarks show substantial improvements in backward transfer and memory efficiency, with competitive forward transfer and scalability to large models. The approach enables robust, privacy-conscious continual learning without relying on explicit task IDs, making it practically impactful for real-world deployment of language models.

Abstract

Prompt-based continual learning (CL) provides a parameter-efficient approach for adapting large language models (LLMs) across task sequences. However, most existing methods rely on task-aware inference and maintain a growing set of task-specific prompts, which introduces two major challenges: (1) severe performance degradation on earlier tasks under task-agnostic inference, and (2) limited scalability due to prompt memory accumulation as task sequences grow. In this paper, we present GRID, a unified framework designed to address these challenges. GRID incorporates a decoding mechanism that enhances backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Furthermore, it employs a gradient-guided prompt selection strategy to compress less informative prompts into a single aggregated representation, ensuring scalable and memory-efficient continual learning. Extensive experiments on long-sequence and negative transfer benchmarks show that GRID improves average accuracy and backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory usage.

Paper Structure

This paper contains 27 sections, 5 equations, 28 figures, 19 tables, 2 algorithms.

Figures (28)

  • Figure 1: Overview of the proposed GRID framework. (S1) model receives a stream of tasks with corresponding datasets. (S2) Representative samples are selected via clustering for each task, and task identification is performed to ensure consistent label formats. (S3) Gradient-based prompt selection is applied: prompts from the frozen prompt pool are ranked based on their gradient norms with respect to the current task; (S4) compressed prompt pool is used to train soft prompts for new tasks with the base model frozen. (S5) During inference, constrained decoding ensures predictions are aligned with the identified task semantics.
  • Figure 2: Heatmaps of backward transfer scores on previous tasks for Order L1. (A) Progressive Prompts, (B) SHLPT, (C) GRID, and differences (D) C–A, (E) C–B.
  • Figure 3: Per-task BWT comparison between our method (blue) and the baseline (red) for Order L1. Positive bars indicate improved retention of prior tasks. Our method shows significant BWT gains on several tasks (e.g., copa, wic, yahoo), demonstrating its effectiveness in mitigating forgetting across diverse task types.
  • Figure 5: Per-task BWT comparison between our method (blue) and the baseline (red) for Order L2. Positive bars indicate improved retention of prior tasks.
  • Figure 7: Per-task BWT comparison between our method (blue) and the baseline (red) for Order L3. Positive bars indicate improved retention of prior tasks.
  • ...and 23 more figures

Theorems & Definitions (1)

  • Definition 1: Task-Agnostic Inference