Table of Contents
Fetching ...

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing Zhu

TL;DR

This work introduces the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets, and proposes a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention.

Abstract

Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

TL;DR

This work introduces the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets, and proposes a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention.

Abstract

Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}
Paper Structure (19 sections, 11 equations, 6 figures, 2 tables)

This paper contains 19 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An overview of Exemplar Free Continual Learning Benchmark for Audio-Visual Segmentation
  • Figure 2: Overview of ATLAS: The framework performs exemplar-free continual audio-visual segmentation using frozen encoder backbones with parameter-efficient LoRA adapters in the visual pathway. Audio-guided pre-fusion conditioning is followed by cross-attention, where conditioned visual features act as queries and audio features serve as keys and values. To mitigate catastrophic forgetting across tasks, our Low-Rank Anchoring (LRA) module dynamically restricts LoRA parameter drift. The fused representation is decoded for segmentation with an optional classification branch.
  • Figure 3: Qualitative AVS results from the top four methods in our CL-AVS setting: Input frame, predicted binary mask for the middle frame is shown along with the ground truth mask. (Left) SS-AVS CIL and (Right) MS-AVS TF-CL
  • Figure 4: Forward Transfer vs. Forgetting: Trade-off between forward transfer and forgetting across models in the CL-AVS benchmark on the SS-AVS dataset under the TIL and DIL protocols.
  • Figure 5: Heatmap of MaxF scores on the SS-AVS test set under the CIL protocol with an 11–2 split, evaluated after each incremental task. Four best-performing methods are included for clarity.
  • ...and 1 more figures