Table of Contents
Fetching ...

TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

Yaru Chen, Faegheh Sardari, Peiliang Zhang, Ruohao Guo, Yang Xiang, Zhenbo Li, Wenwu Wang

TL;DR

This paper addresses weakly supervised AVVP by introducing TEn-CATG, a text-enriched framework that jointly calibrates semantic alignment and temporal reasoning. The Bi-directional Text Fusion (BiT) module uses text prompts derived from pseudo labels and semantic anchors from audio-visual features to refine cross-modal embeddings, mitigating label noise. The Category-Aware Temporal Graph (CATG) learns event-specific, multi-scale temporal dependencies by adaptively selecting temporal hops with Gumbel-Softmax and applying a learnable decay to model temporal relevance. Integrated with a gated fusion mechanism and MMIL pooling, the approach achieves state-of-the-art results on LLP and UnAV-100, demonstrating robust cross-modal parsing under weak supervision and improved per-class discriminability. The framework offers practical impact for robust multimodal event localization with reduced reliance on dense frame-level annotations.$C$ denotes the number of event categories and $T$ the number of segments in a video.

Abstract

Audio-visual video parsing (AVVP) aims to detect event categories and their temporal boundaries in videos, typically under weak supervision. Existing methods mainly focus on (i) improving temporal modeling using attention-based architectures or (ii) generating richer pseudo-labels to address the absence of frame-level annotations. However, attention-based models often overfit noisy pseudo-labels, leading to cumulative training errors, while pseudo-label generation approaches distribute attention uniformly across frames, weakening temporal localization accuracy. To address these challenges, we propose TEn-CATG, a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning. More specifically, we design a bi-directional text fusion (BiT) module by leveraging audio-visual features as semantic anchors to refine text embeddings, which departs from conventional text-to-feature alignment, thereby mitigating noise and enhancing cross-modal consistency. Furthermore, we introduce the category-aware temporal graph (CATG) module to model temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors, enabling effective event-dependent temporal reasoning. Extensive experiments demonstrate that TEn-CATG achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, highlighting its robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.

TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

TL;DR

This paper addresses weakly supervised AVVP by introducing TEn-CATG, a text-enriched framework that jointly calibrates semantic alignment and temporal reasoning. The Bi-directional Text Fusion (BiT) module uses text prompts derived from pseudo labels and semantic anchors from audio-visual features to refine cross-modal embeddings, mitigating label noise. The Category-Aware Temporal Graph (CATG) learns event-specific, multi-scale temporal dependencies by adaptively selecting temporal hops with Gumbel-Softmax and applying a learnable decay to model temporal relevance. Integrated with a gated fusion mechanism and MMIL pooling, the approach achieves state-of-the-art results on LLP and UnAV-100, demonstrating robust cross-modal parsing under weak supervision and improved per-class discriminability. The framework offers practical impact for robust multimodal event localization with reduced reliance on dense frame-level annotations. denotes the number of event categories and the number of segments in a video.

Abstract

Audio-visual video parsing (AVVP) aims to detect event categories and their temporal boundaries in videos, typically under weak supervision. Existing methods mainly focus on (i) improving temporal modeling using attention-based architectures or (ii) generating richer pseudo-labels to address the absence of frame-level annotations. However, attention-based models often overfit noisy pseudo-labels, leading to cumulative training errors, while pseudo-label generation approaches distribute attention uniformly across frames, weakening temporal localization accuracy. To address these challenges, we propose TEn-CATG, a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning. More specifically, we design a bi-directional text fusion (BiT) module by leveraging audio-visual features as semantic anchors to refine text embeddings, which departs from conventional text-to-feature alignment, thereby mitigating noise and enhancing cross-modal consistency. Furthermore, we introduce the category-aware temporal graph (CATG) module to model temporal relationships by selecting multi-scale temporal neighbors and learning category-specific temporal decay factors, enabling effective event-dependent temporal reasoning. Extensive experiments demonstrate that TEn-CATG achieves state-of-the-art results across multiple evaluation metrics on benchmark datasets LLP and UnAV-100, highlighting its robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.

Paper Structure

This paper contains 29 sections, 13 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the AVVP task. An event may occur in one modality, or across different modalities, and at different times.
  • Figure 2: Overview of our model. Segment-level audio/visual features are extracted by frozen CLAP/CLIP, and segment-level text embeddings are generated from pseudo labels. The BiT module is introduced for feature-aware text calibration via co-attention, then the features are aggregated with HAN and enhanced with CATG, a multi-scale temporal graph (i.e. residual GAT). Finally, a gated fusion and MMIL pooling produce audio (Pa), visual (Pv), and joint (P) predictions.
  • Figure 3: Overview of the BiT module. The modality features and text embeddings interact via bi-directional attention, followed by semantic pooling and fusion through an MLP.
  • Figure 4: Per-class event-level F1 score comparison between CoLeaF (blue) and our model (red).
  • Figure 5: Per-class cosine similarity heatmaps of audio (left) and visual (right) features for CoLeaF (top) and our model (bottom).
  • ...and 2 more figures