Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen
TL;DR
This work tackles Audio-Visual Video Parsing (AVVP) by addressing semantic interference that arises when intra- and cross-modal interactions operate on semantically mixed holistic features. It introduces MM-CSE, a two-stage framework comprising Class-Aware Feature Decoupling (CAFD) and Fine-Grained Semantic Enhancement (FGSE). CAFD disentangles segment features into $K$ event-specific plus a background class, guided by reconstruction and orthogonality losses and enhanced by dynamic background fusion. FGSE then models inter-class co-occurrence within a timestamp (SECM) and fuses local segments with informative global context (LGSF), applied in both intra- and cross-modality settings and optimized with a suite of losses, including a novel event co-occurrence loss. Across LLP, MM-CSE achieves state-of-the-art AVVP performance, demonstrates interpretable decoupled class-wise representations, and reveals meaningful event co-occurrence patterns, offering a principled path to robust multimodal event parsing with practical applicability in video understanding tasks.
Abstract
The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a new event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.
