Table of Contents
Fetching ...

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen

TL;DR

This work tackles Audio-Visual Video Parsing (AVVP) by addressing semantic interference that arises when intra- and cross-modal interactions operate on semantically mixed holistic features. It introduces MM-CSE, a two-stage framework comprising Class-Aware Feature Decoupling (CAFD) and Fine-Grained Semantic Enhancement (FGSE). CAFD disentangles segment features into $K$ event-specific plus a background class, guided by reconstruction and orthogonality losses and enhanced by dynamic background fusion. FGSE then models inter-class co-occurrence within a timestamp (SECM) and fuses local segments with informative global context (LGSF), applied in both intra- and cross-modality settings and optimized with a suite of losses, including a novel event co-occurrence loss. Across LLP, MM-CSE achieves state-of-the-art AVVP performance, demonstrates interpretable decoupled class-wise representations, and reveals meaningful event co-occurrence patterns, offering a principled path to robust multimodal event parsing with practical applicability in video understanding tasks.

Abstract

The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a new event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

TL;DR

This work tackles Audio-Visual Video Parsing (AVVP) by addressing semantic interference that arises when intra- and cross-modal interactions operate on semantically mixed holistic features. It introduces MM-CSE, a two-stage framework comprising Class-Aware Feature Decoupling (CAFD) and Fine-Grained Semantic Enhancement (FGSE). CAFD disentangles segment features into event-specific plus a background class, guided by reconstruction and orthogonality losses and enhanced by dynamic background fusion. FGSE then models inter-class co-occurrence within a timestamp (SECM) and fuses local segments with informative global context (LGSF), applied in both intra- and cross-modality settings and optimized with a suite of losses, including a novel event co-occurrence loss. Across LLP, MM-CSE achieves state-of-the-art AVVP performance, demonstrates interpretable decoupled class-wise representations, and reveals meaningful event co-occurrence patterns, offering a principled path to robust multimodal event parsing with practical applicability in video understanding tasks.

Abstract

The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a new event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.

Paper Structure

This paper contains 16 sections, 12 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: (a) Illustration of the AVVP task. (b) Previous methods rely on semantically mixed holistic audio/visual features for intra- and cross-modal interactions, leading to semantic interference. (c) In contrast, we utilize decoupled class-aware features to aggregate event semantics for each segment from only matched classes during interactions.
  • Figure 2: Framework Overview. (a) Our MM-CSE network primarily consists of two core modules: the audio-visual Class-Aware Feature Decoupling (AV-CAFD) and the Fine-Grained Semantic Enhancement (AV-FGSE). (b) The CAFD module decouples the encoded audio/visual features into distinct class-wise features, each representing a specific event or the background class. To ensure effective decoupling, we introduce a reconstruction loss $\mathcal{L}_\text{rec}$ and an orthogonality loss $\mathcal{L}_\text{ort}$. (c) The FGSE module further enhances the obtained class-wise features through two successive blocks: the Segment-wise Event Co-occurrence Modeling (SECM) block and the Local-Global Semantic Fusion (LGSF) block. The SECM encodes the relations among concurrent events within each timestamp, whereas the LGSF enhances the event semantics of local temporal segments by fusing relevant semantics of the global video. We also introduce an event co-occurrence loss $\mathcal{L}_\text{ec}$ to steer the learning of event co-occurrence in the SECM block. Notably, the FGSE module is applied to both intra-modality ('intra-FGSE' ) and cross-modality ('cross-FGSE').
  • Figure 3: Qualitative comparisons. Compared to the previous SOTA method VALOR, our method performs better in identifying multiple overlapping events and recognizing or utilizing the background information.
  • Figure 4: Visualization example of the learned event co-occurrence map.
  • Figure 5: Visualization of our decoupled class-wise features. Each color represents one class.
  • ...and 1 more figures