TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
Yaru Chen, Faegheh Sardari, Peiliang Zhang, Ruohao Guo, Yang Xiang, Zhenbo Li, Wenwu Wang
TL;DR
This paper addresses weakly supervised AVVP by introducing TEn-CATG, a text-enriched framework that jointly calibrates semantic alignment and temporal reasoning. The Bi-directional Text Fusion (BiT) module uses text prompts derived from pseudo labels and semantic anchors from audio-visual features to refine cross-modal embeddings, mitigating label noise. The Category-Aware Temporal Graph (CATG) learns event-specific, multi-scale temporal dependencies by adaptively selecting temporal hops with Gumbel-Softmax and applying a learnable decay to model temporal relevance. Integrated with a gated fusion mechanism and MMIL pooling, the approach achieves state-of-the-art results on LLP and UnAV-100, demonstrating robust cross-modal parsing under weak supervision and improved per-class discriminability. The framework offers practical impact for robust multimodal event localization with reduced reliance on dense frame-level annotations.$C$ denotes the number of event categories and $T$ the number of segments in a video.
Abstract
Audio-visual video parsing (AVVP) aims to detect event categories and their temporal boundaries in videos, typically under weak supervision. Existing methods mainly focus on (i) improving temporal modeling using attention-based architectures or (ii) generating richer pseudo-labels to address the absence of frame-level annotations. However, attention-based models often overfit noisy pseudo-labels, leading to cumulative training errors, while pseudo-label generation approaches distribute attention uniformly across frames, weakening temporal localization accuracy. To address these challenges, we propose TEn-CATG, a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning. More specifically, we design a bi-directional text fusion (BiT) module by leveraging audio-visual features as semantic anchors to refine text embeddings, which departs from conventional text-to-feature alignment, thereby mitigating noise and enhancing cross-modal consistency. Furthermore, we introduce the category-aware temporal graph (CATG) module to model temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors, enabling effective event-dependent temporal reasoning. Extensive experiments demonstrate that TEn-CATG achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, highlighting its robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
