Table of Contents
Fetching ...

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

Seohyun Joo, Yoori Oh

TL;DR

A novel dual-pathway audio encoder is proposed that is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes.

Abstract

Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

TL;DR

A novel dual-pathway audio encoder is proposed that is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes.

Abstract

Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Paper Structure (16 sections, 5 equations, 2 figures, 3 tables)

This paper contains 16 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of baseline and proposed framework. The baseline modeljointav produces uniform scores by relying on global audio-visual features, failing to match the ground-truth. Our proposed framework, however, accurately captures the ground-truth dynamics by utilizing abrupt auditory changes, highlighted by the yellow boxes in the audio, as key features.
  • Figure 2: (a) An overview of the DAViHD framework. The model comprises a Visual Encoder ($E_v$) and a Dual-Pathway Audio Encoder ($E_a^s, E_a^d$). Features from both audio encoders are fused via the Audio Feature Fusion module ($F_a$), and the fused audio feature, $\mathbf{Z}'_a$, is passed to a cross-attention module and an MLP to predict the final score $\hat{y}$. (b) Detailed architecture of the Audio Dynamics Encoder ($E_a^d$). It uses a multi-branch architecture to fuse two attention maps ($\alpha, \beta$), modulated by a saliency gate ($x_s$), with a global average-pooled feature. This information is then used to dynamically control a Frequency-Dynamic convolutional layer.