Table of Contents
Fetching ...

Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed Sameed Husain

TL;DR

The Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size, and opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

Abstract

Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

TL;DR

The Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size, and opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

Abstract

Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.
Paper Structure (27 sections, 12 equations, 4 figures, 3 tables)

This paper contains 27 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of our proposed audio-visual video classification framework on YouTube-8M dataset, illustrating different fusion mechanisms of audio and visual modalities bober2017cultivatingong2018deephe2016deepmiech2017learnable.
  • Figure 2: Illustration of diverse baseline network architectures for audio-visual (AV) fusion: (a,b) Fully Connected (FC) Audio-only and Video-only networks to assess the impact of individual modalities; (c,d) Fully Connected Neural Networks (FCNNs) with early and late fusion strategies; (e,f) Fully Connected Residual Networks (FCRNs) with early and late fusion; and (g,h) Fully Connected Residual Gated Networks (FCRGNs) with early and late fusion, incorporating gating mechanisms for selective feature attention bober2017cultivatingong2018deephe2016deepmiech2017learnable.
  • Figure 3: Attention-based network architectures: (a) FC Attention Network; (b,c) FC Residual Attention Networks with early and late fusion; (d,e) AV Attention Fusion Networks; (f) Self and Cross Modal Attention Network; (g) Network with Self-Attended Features for Cross Modal Attention.
  • Figure 4: Qualitative results comparing the top-3 predictions of our proposed Attend-Fusion model, the state-of-the-art (SOTA) baseline, and the ground truth labels on representative examples from the YouTube-8M dataset.