STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

Jaewoo Lee; Jaehong Yoon; Wonjae Kim; Yunji Kim; Sung Ju Hwang

STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

Jaewoo Lee, Jaehong Yoon, Wonjae Kim, Yunji Kim, Sung Ju Hwang

TL;DR

STELLA tackles continual audio-video pre-training under task-free conditions by addressing sparse spatio-temporal cross-modal correlations and forgetting of audiovisual relations. It introduces an Audio-Video Matching (AVM) module to compute Localized Patch Importance Scores and a Replay-guided Correlation Assessment to identify patches with strong past-step correlation, guiding probabilistic patch selection. By combining these signals, STELLA achieves stronger zero-shot audiovisual retrieval and robust downstream representations while cutting memory usage by approximately 45% via patch-based rehearsal (and STELLA+ further reduces storage by storing only selected patches). The work provides extensive ablation and modality-gap analyses, validating that selective, correlation-aware patch learning mitigates forgetting and preserves multimodal alignment across sequential tasks.

Abstract

Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.

STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

TL;DR

Abstract

Paper Structure (45 sections, 11 equations, 17 figures, 10 tables, 3 algorithms)

This paper contains 45 sections, 11 equations, 17 figures, 10 tables, 3 algorithms.

Introduction
Related Work
Audiovisual understanding
Multimodal continual learning
Continual Audio-Video Pre-training
Problem Statement
Challenges in Continual Audio-Video Pre-training
Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
Localized Patch Importance Scoring
Replay-guided Correlation Assessment
Multimodal Patch Selection for Continual Learning
Experiments
Experimental Setup
Evaluation Protocol
Baselines
...and 30 more sections

Figures (17)

Figure 1: Outdated pre-trained audio-video models struggle with understanding emerging new audio-video semantics.
Figure 2: Challenges in continual audio-video learning.(a): A raw data pair describing a car and its engine sound. (b): Sparse correlations in cross-attention maps. (c): After training on a series of tasks after (b), DER++ focuses on entirely different areas (orange circle), presenting correlation forgetting. (d): Our STELLA maintains consistent attention. More examples are in \ref{['fig:supple_fading_attention']}.
Figure 3: Challenge of multimodal correlation overwriting. Let the model be learned human voice with video frame inputs (blue). During continual pre-training, the model can encounter new semantics sharing key visual objects, humans, making the model overwrite the previously learned audio information associated with humans to a new one (i.e., guitar) (red), resulting in forgetting.
Figure 4: Overview of our approach. Our method harnesses cross-modal attention maps from the AVM module to compute importance scores in order to identify highly correlated patches (\ref{['sec:subsec:positive region proposal']}). Comparing the attention maps created by the current queries with those generated by past queries, we compute correlation scores of the current patches with the past data (\ref{['sec:subsec:forget-robust selection']}). Finally, we perform a probabilistic patch selection, combining the importance scores and correlation scores to select patches for continual audio-video pre-training (\ref{['sec:subsec:patch selection']}).
Figure 5: Downstream performance on various rehearsal memory sizes. We evaluate downstream task performances on the pre-trained models with various rehearsal memory sizes on the Continual-VS.
...and 12 more figures

STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

TL;DR

Abstract

STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (17)