Table of Contents
Fetching ...

Audio-visual Event Localization on Portrait Mode Short Videos

Wuyang Liu, Yi Chai, Yongpeng Yan, Yanzhen Ren

TL;DR

This work addresses audio-visual event localization (AVEL) for portrait-mode short videos by introducing AVE-PM, a large-scale dataset with 25,335 clips across 86 categories and 10-second durations. It analyzes cross-mode generalization between landscape- and portrait-mode videos, revealing an average drop of 18.66% when transferring across modes and identifying spatial priors and audio complexity as key challenges. The authors explore preprocessing strategies (e.g., random cropping, aspect-ratio-aware resizing) and assess the impact of background music on localization, showing that tailored pipelines and model designs can improve portrait-mode AVEL performance. The dataset and code release are promised to catalyze research on mobile-centric multimodal understanding and domain-adaptive AVEL methods.

Abstract

Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and layered audio compositions (e.g., overlapping sound effects, voiceovers, and music), which brings unique challenges unaddressed by conventional methods. To this end, we introduce AVE-PM, the first AVEL dataset specifically designed for portrait mode short videos, comprising 25,335 clips that span 86 fine-grained categories with frame-level annotations. Beyond dataset creation, our empirical analysis shows that state-of-the-art AVEL methods suffer an average 18.66% performance drop during cross-mode evaluation. Further analysis reveals two key challenges of different video formats: 1) spatial bias from portrait-oriented framing introduces distinct domain priors, and 2) noisy audio composition compromise the reliability of audio modality. To address these issues, we investigate optimal preprocessing recipes and the impact of background music for AVEL on portrait mode videos. Experiments show that these methods can still benefit from tailored preprocessing and specialized model design, thus achieving improved performance. This work provides both a foundational benchmark and actionable insights for advancing AVEL research in the era of mobile-centric video content. Dataset and code will be released.

Audio-visual Event Localization on Portrait Mode Short Videos

TL;DR

This work addresses audio-visual event localization (AVEL) for portrait-mode short videos by introducing AVE-PM, a large-scale dataset with 25,335 clips across 86 categories and 10-second durations. It analyzes cross-mode generalization between landscape- and portrait-mode videos, revealing an average drop of 18.66% when transferring across modes and identifying spatial priors and audio complexity as key challenges. The authors explore preprocessing strategies (e.g., random cropping, aspect-ratio-aware resizing) and assess the impact of background music on localization, showing that tailored pipelines and model designs can improve portrait-mode AVEL performance. The dataset and code release are promised to catalyze research on mobile-centric multimodal understanding and domain-adaptive AVEL methods.

Abstract

Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and layered audio compositions (e.g., overlapping sound effects, voiceovers, and music), which brings unique challenges unaddressed by conventional methods. To this end, we introduce AVE-PM, the first AVEL dataset specifically designed for portrait mode short videos, comprising 25,335 clips that span 86 fine-grained categories with frame-level annotations. Beyond dataset creation, our empirical analysis shows that state-of-the-art AVEL methods suffer an average 18.66% performance drop during cross-mode evaluation. Further analysis reveals two key challenges of different video formats: 1) spatial bias from portrait-oriented framing introduces distinct domain priors, and 2) noisy audio composition compromise the reliability of audio modality. To address these issues, we investigate optimal preprocessing recipes and the impact of background music for AVEL on portrait mode videos. Experiments show that these methods can still benefit from tailored preprocessing and specialized model design, thus achieving improved performance. This work provides both a foundational benchmark and actionable insights for advancing AVEL research in the era of mobile-centric video content. Dataset and code will be released.

Paper Structure

This paper contains 20 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A glance of AVE-PM, the first audio-visual event dataset on short videos with human-annotated temporal boundaries. It consists of 25,335 10-second videos that span over 8 domains and 86 categories. The samples presented here are playing guitar, piercing balloon, using a hammer, train running, auto racing and meowing.
  • Figure 2: Illustrations of statistics on AVE-PM. (a) Distribution of number of events per category. Categories are grouped by domains. Different colors represent different domains. (b) Distribution of event duration. (c) Distribution of aspect ratios in AVE-PM, where 94.7% videos are in portrait mode with 9:16 format (width:height).
  • Figure 3: Distribution of categories with the highest and lowest BGM ratios. The top subplot shows the top 10 categories with the highest BGM ratio, while the bottom subplot displays the top 10 categories with the lowest BGM ratio. The bars represent the count of samples with and without BGM for each category.
  • Figure 4: The accuracy heatmaps of evaluating LAVISH at different spatial locations on the S-PM subset. (a) Accuracy heatmap of LAVISH model trained on S-LM. (b) Accuracy heatmap of LAVISH model trained on S-PM. (c) The difference map represents the subtraction of the accuracy of the model trained on S-LM from the model trained on S-PM, i.e., (b) - (a).
  • Figure 5: The accuracy heatmaps of evaluating LAVISH at different spatial locations on the S-LM subset. (a) Accuracy heatmap of LAVISH model trained on S-LM. (b) Accuracy heatmap of LAVISH model trained on S-PM. (c) The difference map represents the subtraction of the accuracy of the model trained on S-PM from the model trained on S-LM, i.e., (a) - (b).
  • ...and 1 more figures