Table of Contents
Fetching ...

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, Mohammed, Hoque

TL;DR

This work introduces UR-FUNNY, the first large-scale multimodal dataset for humor detection that combines text, vision, and acoustic cues from TED talks, with explicit punchline and context annotations derived from laughter markers. It formulates the problem as predicting laughter given punchline and context across modalities and proposes the Contextual Memory Fusion Network (C-MFN) to integrate unimodal context, multimodal context, and memory-based fusion. Experiments demonstrate that leveraging all three modalities and contextual information improves humor detection, with punchline information playing a pivotal role and human performance remaining higher than current baselines. UR-FUNNY thus provides a valuable resource and a strong baseline framework for multimodal humor understanding in NLP.

Abstract

Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

TL;DR

This work introduces UR-FUNNY, the first large-scale multimodal dataset for humor detection that combines text, vision, and acoustic cues from TED talks, with explicit punchline and context annotations derived from laughter markers. It formulates the problem as predicting laughter given punchline and context across modalities and proposes the Contextual Memory Fusion Network (C-MFN) to integrate unimodal context, multimodal context, and memory-based fusion. Experiments demonstrate that leveraging all three modalities and contextual information improves humor detection, with punchline information playing a pivotal role and human performance remaining higher than current baselines. UR-FUNNY thus provides a valuable resource and a strong baseline framework for multimodal humor understanding in NLP.

Abstract

Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An example of the UR-FUNNY dataset. UR-FUNNY presents a framework to study the dynamics of humor in multimodal language. Machine learning models are given a sequence of sentences with the accompanying modalities of vision and acoustic. Their goal is to detect whether or not the sequence will trigger immediate laughter by detecting whether or not the last sentence constitutes a punchline.
  • Figure 2: Overview of UR-FUNNY dataset statistics. (a) the distribution of punchline sentence length for humor and non-humor cases. (b) the distribution of context sentence length for humor and non-humor cases. (c) distribution of the number of sentences in the context. (d) distribution of the duration (in seconds) of punchline and context sentences. (e) topics of the videos in UR-FUNNY dataset. Best viewed in zoomed and color.
  • Figure 3: The structure of Unimodal Context Network as outlined in Section \ref{['subsec:unimodal_net']}. For demonstration purpose, we show the case for $n=2$ (second context sentence). After $n=N_C$, the output $H$ (outlined by blue) is complete. Best viewed in color.
  • Figure 4: The structure of Multimodal Context Network as outlined in Section \ref{['subsec:multimodal_net']}. The output $H$ of the Unimodal Context Network is connected to an encoder module to get the multimodal output $\hat{H}$. For the details of components outlined in orange please refer to the authors' original paper. vaswani2017attention. Best viewed in color.
  • Figure 5: The initialization and recurrence process of Memory Fusion Network (MFN). The outputs of Unimodal and Multimodal Context Networks ($H$ and $\hat{H}$) are used initializing the MFN neural components. For the details of components outlined in orange please refer to the authors' original paper zadeh2018memory. Best viewed in color.