Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation
Kexin Li, Zongxin Yang, Yi Yang, Jun Xiao
TL;DR
This paper tackles temporal misalignment in audio-visual video segmentation (AVVS) by introducing the Collaborative Hybrid Propagator (Co-Prop), a two-stage framework that first anchors audio-boundaries and then propagates segmentation frame-by-frame with audio guidance. The Retrieval-Augmented Control Points Generation (RCPG) module uses retrieval-augmented prompts and a large language model (Qwen) to identify control points that mark transitions between sound sources, enabling the audio to be split into semantically consistent subclips. The Audio-Inserted Propagator (AIP) employs a Keyframe Processor to generate keyframe masks and then propagates to normal frames by embedding frame-wise audio information via cross-attention, achieving frame-aligned integration and reduced memory usage. Empirical results across three AVVS benchmarks (M3, S4, AVSS) and two backbones show improved alignment rates and pixel-edge fidelity, with the method offering plug-and-play compatibility to enhance existing AVVS approaches. The work also introduces a compact dataset and an Alignment Rate metric to assess temporal coherence, advancing practical AVVS deployment in applications like AR/VR and surveillance.
Abstract
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio. However, existing methods often face temporal misalignment, where audio cues and segmentation results are not temporally coordinated. Audio provides two critical pieces of information: i) target object-level details and ii) the timing of when objects start and stop producing sounds. Current methods focus more on object-level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment. To address this issue, we propose a Collaborative Hybrid Propagator Framework~(Co-Prop). This framework includes two main steps: Preliminary Audio Boundary Anchoring and Frame-by-Frame Audio-Insert Propagation. To Anchor the audio boundary, we employ retrieval-assist prompts with Qwen large language models to identify control points of audio semantic changes. These control points split the audio into semantically consistent audio portions. After obtaining the control point lists, we propose the Audio Insertion Propagator to process each audio portion using a frame-by-frame audio insertion propagation and matching approach. We curated a compact dataset comprising diverse source conversion cases and devised a metric to assess alignment rates. Compared to traditional simultaneous processing methods, our approach reduces memory requirements and facilitates frame alignment. Experimental results demonstrate the effectiveness of our approach across three datasets and two backbones. Furthermore, our method can be integrated with existing AVVS approaches, offering plug-and-play functionality to enhance their performance.
