Table of Contents
Fetching ...

Continual Contrastive Spoken Language Understanding

Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna, Alessio Brutti, Bhiksha Raj

TL;DR

This work tackles continual learning in end-to-end spoken language understanding by addressing catastrophic forgetting in a class-incremental setting. It introduces COCONUT, which combines experience replay with two contrastive losses—NSPT for distillation-inspired representation preservation and MM for cross-modal alignment of audio and text—for robust, transferable representations. The approach yields consistent improvements over baselines on FSC and SLURP, and can further enhance performance when paired with decoder-focused knowledge distillation. By jointly preserving past knowledge and refining new-task representations, COCONUT offers a practical pathway to scalable, continually learned SLU systems.

Abstract

Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from the same class and pushing away the others. Moreover, we leverage a multimodal contrastive loss that helps the model learn more discriminative representations of the new data by aligning audio and text features. We also investigate different contrastive designs to combine the strengths of the contrastive loss with teacher-student architectures used for distillation. Experiments on two established SLU datasets reveal the effectiveness of our proposed approach and significant improvements over the baselines. We also show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.

Continual Contrastive Spoken Language Understanding

TL;DR

This work tackles continual learning in end-to-end spoken language understanding by addressing catastrophic forgetting in a class-incremental setting. It introduces COCONUT, which combines experience replay with two contrastive losses—NSPT for distillation-inspired representation preservation and MM for cross-modal alignment of audio and text—for robust, transferable representations. The approach yields consistent improvements over baselines on FSC and SLURP, and can further enhance performance when paired with decoder-focused knowledge distillation. By jointly preserving past knowledge and refining new-task representations, COCONUT offers a practical pathway to scalable, continually learned SLU systems.

Abstract

Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from the same class and pushing away the others. Moreover, we leverage a multimodal contrastive loss that helps the model learn more discriminative representations of the new data by aligning audio and text features. We also investigate different contrastive designs to combine the strengths of the contrastive loss with teacher-student architectures used for distillation. Experiments on two established SLU datasets reveal the effectiveness of our proposed approach and significant improvements over the baselines. We also show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
Paper Structure (22 sections, 7 equations, 5 figures, 6 tables)

This paper contains 22 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of COCONUT . It uses two contrastive learning-based losses. The NSPT (negative-student positive-teacher) loss is a supervised contrastive distillation loss that preserves the feature representations of the past classes for both audio and text samples. The positive and negative samples are computed with the teacher and student model, respectively. The MM (multi-modal) loss aims to align audio and text representations belonging to the same new class. COCONUT produces features that are more transferable and resilient to catastrophic forgetting.
  • Figure 2: Illustration of the NTPT loss and our proposed NSPT loss. Given an anchor sample from the current mini-batch, the NTPT loss computes the negatives and positives using the teacher model (dashed circles). Instead, the NSPT loss computes the positives with the teacher while the negatives are computed with the student model (solid circles). If the features obtained with the teacher are scattered and static (the teacher is frozen), those obtained with the student are more clustered and can be learned during the current task. Best viewed in color.
  • Figure 3: Left: the trend of the intent accuracy on the observed tasks for the FSC-6 setting. Right: the trend of the intent accuracy on the observed tasks for SLURP-6.
  • Figure 4: Left: the trend of the WER on the observed tasks for the FSC-6 setting. Right: the accuracy of COCONUT and other methods as a function of the memory size.
  • Figure 5: Computational cost analysis of various CIL methods for FSC-6 (left) and SLURP-6 (right).