Table of Contents
Fetching ...

Efficient Audio-Visual Fusion for Video Classification

Mahrukh Awan, Asmar Nadeem, Armin Mustafa

TL;DR

Through extensive experiments on the YouTube-8M dataset, it is demonstrated that the Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.

Abstract

We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.

Efficient Audio-Visual Fusion for Video Classification

TL;DR

Through extensive experiments on the YouTube-8M dataset, it is demonstrated that the Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.

Abstract

We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.

Paper Structure

This paper contains 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of our proposed audio-visual video classification framework on YouTube-8M dataset abu2016youtube, illustrating different fusion mechanisms of audio and visual modalities.
  • Figure 2: Comparison of Fully-Connected (FC) Late Fusion (baseline) and Attend-Fusion architectures.
  • Figure 3: Qualitative results comparing the top-3 predictions of Attend-Fusion, FC Late Fusion (SOTA baseline), and the ground truth labels on representative examples from the YouTube-8M dataset.