UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, Mohammed, Hoque
TL;DR
This work introduces UR-FUNNY, the first large-scale multimodal dataset for humor detection that combines text, vision, and acoustic cues from TED talks, with explicit punchline and context annotations derived from laughter markers. It formulates the problem as predicting laughter given punchline and context across modalities and proposes the Contextual Memory Fusion Network (C-MFN) to integrate unimodal context, multimodal context, and memory-based fusion. Experiments demonstrate that leveraging all three modalities and contextual information improves humor detection, with punchline information playing a pivotal role and human performance remaining higher than current baselines. UR-FUNNY thus provides a valuable resource and a strong baseline framework for multimodal humor understanding in NLP.
Abstract
Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.
