Table of Contents
Fetching ...

DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes

Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani

TL;DR

DeCoR tackles catastrophic forgetting in lifelong audio representation learning by distilling prior knowledge into the current model through predicting delayed codebook indices, avoiding storage of past data or teacher models. It constructs a delayed codebook at task boundaries and trains an index predictor to regularize the new model, achieving continual learning with minimal storage and computation. Evaluations on TAU Urban Acoustic Scenes show consistent improvements in final seen accuracy $A_T$ and reduced forgetting $F_T$ for both supervised and self-supervised setups, outperforming replay and standard distillation baselines and synergizing with SimCLR. The approach offers a lightweight, scalable solution for continual audio representation learning with potential applicability to other audio tasks and online settings.

Abstract

Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time. However, optimizing the model only on new data can lead to catastrophic forgetting of previously learned tasks, which undermines the model's ability to perform well over the long term. This paper introduces a new approach to continual audio representation learning called DeCoR. Unlike other methods that store previous data, features, or models, DeCoR indirectly distills knowledge from an earlier model to the latest by predicting quantization indices from a delayed codebook. We demonstrate that DeCoR improves acoustic scene classification accuracy and integrates well with continual self-supervised representation learning. Our approach introduces minimal storage and computation overhead, making it a lightweight and efficient solution for continual learning.

DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes

TL;DR

DeCoR tackles catastrophic forgetting in lifelong audio representation learning by distilling prior knowledge into the current model through predicting delayed codebook indices, avoiding storage of past data or teacher models. It constructs a delayed codebook at task boundaries and trains an index predictor to regularize the new model, achieving continual learning with minimal storage and computation. Evaluations on TAU Urban Acoustic Scenes show consistent improvements in final seen accuracy and reduced forgetting for both supervised and self-supervised setups, outperforming replay and standard distillation baselines and synergizing with SimCLR. The approach offers a lightweight, scalable solution for continual audio representation learning with potential applicability to other audio tasks and online settings.

Abstract

Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time. However, optimizing the model only on new data can lead to catastrophic forgetting of previously learned tasks, which undermines the model's ability to perform well over the long term. This paper introduces a new approach to continual audio representation learning called DeCoR. Unlike other methods that store previous data, features, or models, DeCoR indirectly distills knowledge from an earlier model to the latest by predicting quantization indices from a delayed codebook. We demonstrate that DeCoR improves acoustic scene classification accuracy and integrates well with continual self-supervised representation learning. Our approach introduces minimal storage and computation overhead, making it a lightweight and efficient solution for continual learning.
Paper Structure (16 sections, 6 equations, 4 figures, 2 tables)

This paper contains 16 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of DeCoR with other continual learning methods. The arrows indicate computation, and the green boxes indicate extra storage. We can observe that DeCoR is more efficient in terms of both computation and storage. Replay-based methods require storing and training on the audio rehearsal. Model knowledge distillation requires additional space and computation to store and forward one or more past model checkpoints. Contrastive learning demands training on multiple augmented views of the same audio. In contrast, DeCoR only stores and predicts one quantization index per audio.
  • Figure 2: Graphical illustration of how DeCoR works with irrelevant model components omitted for clarity. The gray arrows correspond to actions taken during the INCREMENT step at the task boundary, where new task data is encoded, clustered, and indexed using the delayed codebook. The red arrows correspond to actions taken during the DISTILL step throughout the task, where the model is trained to predict the indices assigned earlier in the INCREMENT step. Notably, there is no direct connection between the current and previous models, and knowledge is distilled solely through index prediction. Past models and codebooks are depicted for illustration purposes, with only quantization indices being stored.
  • Figure 3: LEP and SLEP accuracy evaluated at the end of every task. The order from left to right is as follows: LEP for supervised training, LEP for self-supervised training, SLEP for supervised training, and SLEP for self-supervised training. We observe an improvement in $A_t$ with DeCoR starting from $t=2$. An exception is $t=2$ for 10-task supervised training, where the model only learns one class at $t=1$, making it too trivial to classify and resulting in the distilled knowledge from $t=1$ being useless.
  • Figure 4: Impact of DeCoR codebook size $K$ and the predictor layer number $L$ to the final LEP (top) and SLEP (bottom) accuracy and forgetting for 5-task supervised training. All combinations result in better performance compared to the Baseline.