Table of Contents
Fetching ...

AVR: Synergizing Foundation Models for Audio-Visual Humor Detection

Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma

TL;DR

This work tackles humor detection without reliance on transcripts by developing AVR, an audio-visual system that leverages foundation models to extract multimodal representations and detect humor. It combines VideoMAE for visual encoding, AST for audio, and LanguageBind for multimodal binding, producing 768-dimensional features that feed CNN/LSTM classifiers. Empirical results show the CNN with VideoMAE+AST representations achieving the top performance (best among tested combos) in a 5-fold cross-validation setup, demonstrating the benefit of AV-only fusion over text-based cues. The system ships with a Tkinter-based UI for cross-platform use and FFmpeg-based preprocessing, delivering 5–10 second inference on short videos, thereby reducing ASR dependency and enabling practical audiovisual humor analysis.

Abstract

In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in real-world applications. To address this bottleneck, we propose an innovative audio-visual humor detection system that circumvents textual reliance, eliminating the need for ASR models. Instead, the proposed approach hinges on the intricate interplay between audio and visual content for effective humor detection.

AVR: Synergizing Foundation Models for Audio-Visual Humor Detection

TL;DR

This work tackles humor detection without reliance on transcripts by developing AVR, an audio-visual system that leverages foundation models to extract multimodal representations and detect humor. It combines VideoMAE for visual encoding, AST for audio, and LanguageBind for multimodal binding, producing 768-dimensional features that feed CNN/LSTM classifiers. Empirical results show the CNN with VideoMAE+AST representations achieving the top performance (best among tested combos) in a 5-fold cross-validation setup, demonstrating the benefit of AV-only fusion over text-based cues. The system ships with a Tkinter-based UI for cross-platform use and FFmpeg-based preprocessing, delivering 5–10 second inference on short videos, thereby reducing ASR dependency and enabling practical audiovisual humor analysis.

Abstract

In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in real-world applications. To address this bottleneck, we propose an innovative audio-visual humor detection system that circumvents textual reliance, eliminating the need for ASR models. Instead, the proposed approach hinges on the intricate interplay between audio and visual content for effective humor detection.
Paper Structure (4 sections, 2 figures, 1 table)

This paper contains 4 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Architecture of the Proposed Modeling Network
  • Figure 2: User Interface