Table of Contents
Fetching ...

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei

TL;DR

This work tackles real-time online audio-visual event parsing (On-AVEP) by introducing the Predictive Future Modeling (PreFM) framework. PreFM jointly utilizes predictive multimodal future modeling to infer beneficial future cues, modality-agnostic robust representation via knowledge distillation, and focal temporal prioritization to enhance online inference under limited context. Across UnAV-100 and LLP, PreFM achieves state-of-the-art performance with dramatically fewer parameters and lower computational demands, demonstrating practical viability for resource-constrained, real-time multimodal understanding (e.g., up to 51.9 FPS with low latency). The combination of PMFM, MRR, and FTP enables precise online parsing of audio, visual, and audio-visual events, offering a scalable approach for real-time video understanding with broad applicability.

Abstract

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

TL;DR

This work tackles real-time online audio-visual event parsing (On-AVEP) by introducing the Predictive Future Modeling (PreFM) framework. PreFM jointly utilizes predictive multimodal future modeling to infer beneficial future cues, modality-agnostic robust representation via knowledge distillation, and focal temporal prioritization to enhance online inference under limited context. Across UnAV-100 and LLP, PreFM achieves state-of-the-art performance with dramatically fewer parameters and lower computational demands, demonstrating practical viability for resource-constrained, real-time multimodal understanding (e.g., up to 51.9 FPS with low latency). The combination of PMFM, MRR, and FTP enables precise online parsing of audio, visual, and audio-visual events, offering a scalable approach for real-time video understanding with broad applicability.

Abstract

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

Paper Structure

This paper contains 54 sections, 10 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: (a) Illustration of parsing events in online scenarios: if a man opens his mouth and produces a vocal sound at time $T$, it is unclear based solely on information from $0$ to $T$ whether this marks the beginning of a musical phrase (as part of "singing") or the start of a conversation (as in "speaking"). Precisely parsing events with these limited context is crucial for accurate online inference. (b) Simplified architecture of our PreFM framework, highlighting predictive future modeling and modality-agnostic robust representation (MRR). (c) Comparison of performance and efficiency against SOTA methods on the UnAV-100 unav_geng2023dense dataset.
  • Figure 2: The pipeline of PreFM. It takes real-time audio-visual streams, using predictive modeling to generate multimodal future context, modality-agnostic robust representation to enhance performance by transferring knowledge, and focal temporal prioritization to emphasize the current time step $T$.
  • Figure 3: Temporal-modality cross fusion for the pseudo-future $\hat{F}_f^v$.
  • Figure 4: (a) The performance across different relative time steps. (b) t-SNE visualization of the pre-classifier features. We use nine animal events from UnAV-100 unav_geng2023dense for better illustration.
  • Figure 5: (a) The visualization on the On-AVEL task. (b) The visualization on the On-AVVP task. GT: ground truth. The red dotted box indicates the area of mispredictions.
  • ...and 3 more figures