Table of Contents
Fetching ...

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models

Lee Hyun, Kim Sung-Bin, Seungju Han, Youngjae Yu, Tae-Hyun Oh

TL;DR

This work introduces Video Laugh Reasoning, a task aimed at explaining why laughter occurs in video. It presents SMILE, a new multimodal dataset of 887 clips (TED and sitcoms) paired with human-generated explanations for laughter, focusing on audience laughter to reduce subjectivity. A baseline using large language models with a multimodal textual representation (visual, acoustic, and semantic cues) demonstrates that LLMs can generate plausible, though not yet human-level, reasons for laughter and can scale to other video understanding tasks and in-the-wild content. The study shows the importance of multimodal information and model scale, provides comprehensive evaluation with standard text-generation metrics and human judgments, and offers insights into the modality contributions across video types and tasks. Overall, SMILE and the proposed approach advance socially intelligent AI capable of interpreting nonverbal signals, with implications for dialogue systems, affective computing, and human-robot interaction.

Abstract

Despite the recent advances of the artificial intelligence, building social intelligence remains a challenge. Among social signals, laughter is one of the distinctive expressions that occurs during social interactions between humans. In this work, we tackle a new challenge for machines to understand the rationale behind laughter in video, Video Laugh Reasoning. We introduce this new task to explain why people laugh in a particular video and a dataset for this task. Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh. We propose a baseline by leveraging the reasoning capacity of large language models (LLMs) with textual video representation. Experiments show that our baseline can generate plausible explanations for laughter. We further investigate the scalability of our baseline by probing other video understanding tasks and in-the-wild videos. We release our dataset, code, and model checkpoints on https://github.com/postech-ami/SMILE-Dataset.

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models

TL;DR

This work introduces Video Laugh Reasoning, a task aimed at explaining why laughter occurs in video. It presents SMILE, a new multimodal dataset of 887 clips (TED and sitcoms) paired with human-generated explanations for laughter, focusing on audience laughter to reduce subjectivity. A baseline using large language models with a multimodal textual representation (visual, acoustic, and semantic cues) demonstrates that LLMs can generate plausible, though not yet human-level, reasons for laughter and can scale to other video understanding tasks and in-the-wild content. The study shows the importance of multimodal information and model scale, provides comprehensive evaluation with standard text-generation metrics and human judgments, and offers insights into the modality contributions across video types and tasks. Overall, SMILE and the proposed approach advance socially intelligent AI capable of interpreting nonverbal signals, with implications for dialogue systems, affective computing, and human-robot interaction.

Abstract

Despite the recent advances of the artificial intelligence, building social intelligence remains a challenge. Among social signals, laughter is one of the distinctive expressions that occurs during social interactions between humans. In this work, we tackle a new challenge for machines to understand the rationale behind laughter in video, Video Laugh Reasoning. We introduce this new task to explain why people laugh in a particular video and a dataset for this task. Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh. We propose a baseline by leveraging the reasoning capacity of large language models (LLMs) with textual video representation. Experiments show that our baseline can generate plausible explanations for laughter. We further investigate the scalability of our baseline by probing other video understanding tasks and in-the-wild videos. We release our dataset, code, and model checkpoints on https://github.com/postech-ami/SMILE-Dataset.
Paper Structure (39 sections, 14 figures, 7 tables)

This paper contains 39 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Why do people laugh? We present Video Laugh Reasoning, a new task to interpret the reasons behind laughter in a video.
  • Figure 2: Video Laugh Reasoning task and multimodal textual representation. Each video clip ($v$) is trimmed into list of video segments ($s_i$), and each video segment is encoded into textual representation ($t_i$). The textual video representation consists of visual cues ($V$), acoustic cues from speech ($A$), and semantic cue (transcript, denoted as $T$). Then, we use LLM to generate why the audience laughs at the given video with the prompt. The bold text in parentheses on the $t$ shows that LLM is semantically aware of the textual video representation.
  • Figure 3: Which multimodal cue is important to reason the laughter? While semantic content is the most influential in causing laughter, the $2^{\text{nd}}$ ranked modality cues are diverse, suggesting that multiple modality information can simultaneously influence laughter.
  • Figure 4: Prompt for laugh reasoning experiments on GPT3. The prompt is fed into GPT3 GPT3 for fine-tuning, zero-shot learning, and in-context learning. For in-context learning, three random samples of prompt-answer pairs from the training set are given to GPT3. We manually change video types (sitcom or TED) and video title using the meta information of video clips. The query stands for multimodal textual representation $m$ of the video clip. The length of the generated output is also variable, with a maximum of 30 words for sitcoms and 40 words for TED talks, considering each video type's characteristics.
  • Figure 5: Qualitative results on laugh reasoning. For the examples in (a), GPT-3 GPT3 fine-tuned on our dataset (denoted FT w/ A+V+T) understands the reasons for laughter by referencing multimodal cues. In contrast, the model fine-tuned using the transcript-only (denoted FT w/ T) manages to understand the reasons partially. The visual cues (scene description) are crucial for capturing "joey's sudden appearance" which is important to infer the reason for laughter in (b).
  • ...and 9 more figures