Table of Contents
Fetching ...

FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

Zhi-Song Liu, Robin Courant, Vicky Kalogeiton

TL;DR

This work tackles the problem of automatically identifying funny moments in videos by leveraging a multimodal framework that integrates visual, audio, and text information. FunnyNet-W uses a three-encoder architecture with a Cross-Attention Fusion module to learn cross-modal correlations, trained with a self-supervised contrastive loss and a binary funny/not-funny classifier, while an unsupervised laughter detector provides training labels without manual transcripts. The approach achieves state-of-the-art results across five datasets, both in settings requiring ground-truth text and in the wild with automatically generated text, and demonstrates strong generalization to new domains. The study also analyzes the relative contributions of each modality, the fusion mechanism, and the impact of audio quality, offering practical insights for deploying multimodal humor detection systems and discussing ethical and environmental considerations.

Abstract

Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and (c) text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.

FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

TL;DR

This work tackles the problem of automatically identifying funny moments in videos by leveraging a multimodal framework that integrates visual, audio, and text information. FunnyNet-W uses a three-encoder architecture with a Cross-Attention Fusion module to learn cross-modal correlations, trained with a self-supervised contrastive loss and a binary funny/not-funny classifier, while an unsupervised laughter detector provides training labels without manual transcripts. The approach achieves state-of-the-art results across five datasets, both in settings requiring ground-truth text and in the wild with automatically generated text, and demonstrates strong generalization to new domains. The study also analyzes the relative contributions of each modality, the fusion mechanism, and the impact of audio quality, offering practical insights for deploying multimodal humor detection systems and discussing ethical and environmental considerations.

Abstract

Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and (c) text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.
Paper Structure (51 sections, 3 equations, 14 figures, 12 tables)

This paper contains 51 sections, 3 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: What is funny? Audio cues along with visual frames and textual data are a rich source of information for identifying funny moments in videos. Video scene from Pulp Fiction, 1994, source video https://www.youtube.com/watch?v=4L5LjjYVsHQ
  • Figure 2: Architecture of FunnyNet-W. Given audio-visual clips, FunnyNet-W predicts funny moments in videos. It consists of the audio (blue), textual (red), and visual (green) encoders, whose outputs pass through the Cross Attention Fusion (CAF), which consists of cross-attention (CA) and self-attention (SA) for feature fusion. It is trained to embed all modalities in the same space via self-supervision ($L_{\text{ss}}$) and to classify clips as funny or not-funny ($L_{\text{cls}}$).
  • Figure 3: Proposed laughter detector. It takes raw waveforms as input and consists of (i) removing voices by subtracting channels (here, the audio is stereo with 2 channels), (ii) detecting peaks, and (iii) clustering audios to music and laughter.
  • Figure 4: Comparison of various time window lengths used as input of the (top) visual encoder of FunnyNet-W (referred to as FunnyNet-W V) and (bottom) audio encoder of FunnyNet-W (referred to as FunnyNet-W A). We illustrate (left,a) the F1 score and (right,b) the accuracy on different datasets. The average results are plotted in red lines.
  • Figure 5: Comparison of different lengths of time windows for the visual encoder of FunnyNet-W (referred to as FunnyNet-W V). We illustrate (a) the F1 score and (b) the accuracy on different datasets. The average results are plotted in red points and lines for Timesformer and magenta for VideoMAE.
  • ...and 9 more figures