Table of Contents
Fetching ...

HOP to the Next Tasks and Domains for Continual Learning in NLP

Umberto Michieli, Mete Ozay

TL;DR

This paper tackles continual learning in NLP across tasks and domains by introducing HOP, a framework that integrates Adapter-BERT with high-order pooling of token embeddings and per-task auxiliary MLP heads. By computing multiple moments $m_1, m_2, \dots, m_p$ from token distributions (where $m_1$ is AVG) and concatenating them, HOP captures tailed distribution shifts that a single [CLS] token misses, enabling effective knowledge transfer while mitigating catastrophic forgetting. The approach supports both task-incremental and domain-incremental learning in a unified, parameter-efficient manner, and demonstrates strong performance across 4 NLP applications, 5 benchmarks, and 2 CL setups, with favorable accuracy and KT/CF metrics and low compute overhead. Overall, HOP provides a simple yet powerful baseline that improves CL in NLP and can be integrated on top of existing methods to boost performance with minimal additional cost.

Abstract

Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP.

HOP to the Next Tasks and Domains for Continual Learning in NLP

TL;DR

This paper tackles continual learning in NLP across tasks and domains by introducing HOP, a framework that integrates Adapter-BERT with high-order pooling of token embeddings and per-task auxiliary MLP heads. By computing multiple moments from token distributions (where is AVG) and concatenating them, HOP captures tailed distribution shifts that a single [CLS] token misses, enabling effective knowledge transfer while mitigating catastrophic forgetting. The approach supports both task-incremental and domain-incremental learning in a unified, parameter-efficient manner, and demonstrates strong performance across 4 NLP applications, 5 benchmarks, and 2 CL setups, with favorable accuracy and KT/CF metrics and low compute overhead. Overall, HOP provides a simple yet powerful baseline that improves CL in NLP and can be integrated on top of existing methods to boost performance with minimal additional cost.

Abstract

Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP.
Paper Structure (7 sections, 2 equations, 4 figures, 5 tables)

This paper contains 7 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The proposed HOP framework. During the incremental step $T$, only orange modules are trained, while gray and green modules are frozen.
  • Figure 2: Accuracy matrix showing the main CL metrics used in this work. $\hat{\mathcal{S}}_t$ and ${\mathcal{S}}_t$ are the testing and training datasets at step $t$.
  • Figure 3: Per-problem accuracy ($mAcc_t$ and $mAcc_{t,\leq T}$) on the DSC small dataset for both TIL and DIL setups.
  • Figure 4: mAcc vs. training time per problem on the TIL setup. Optimal results are in the top-left corner.