Audio-visual Event Localization on Portrait Mode Short Videos
Wuyang Liu, Yi Chai, Yongpeng Yan, Yanzhen Ren
TL;DR
This work addresses audio-visual event localization (AVEL) for portrait-mode short videos by introducing AVE-PM, a large-scale dataset with 25,335 clips across 86 categories and 10-second durations. It analyzes cross-mode generalization between landscape- and portrait-mode videos, revealing an average drop of 18.66% when transferring across modes and identifying spatial priors and audio complexity as key challenges. The authors explore preprocessing strategies (e.g., random cropping, aspect-ratio-aware resizing) and assess the impact of background music on localization, showing that tailored pipelines and model designs can improve portrait-mode AVEL performance. The dataset and code release are promised to catalyze research on mobile-centric multimodal understanding and domain-adaptive AVEL methods.
Abstract
Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and layered audio compositions (e.g., overlapping sound effects, voiceovers, and music), which brings unique challenges unaddressed by conventional methods. To this end, we introduce AVE-PM, the first AVEL dataset specifically designed for portrait mode short videos, comprising 25,335 clips that span 86 fine-grained categories with frame-level annotations. Beyond dataset creation, our empirical analysis shows that state-of-the-art AVEL methods suffer an average 18.66% performance drop during cross-mode evaluation. Further analysis reveals two key challenges of different video formats: 1) spatial bias from portrait-oriented framing introduces distinct domain priors, and 2) noisy audio composition compromise the reliability of audio modality. To address these issues, we investigate optimal preprocessing recipes and the impact of background music for AVEL on portrait mode videos. Experiments show that these methods can still benefit from tailored preprocessing and specialized model design, thus achieving improved performance. This work provides both a foundational benchmark and actionable insights for advancing AVEL research in the era of mobile-centric video content. Dataset and code will be released.
