Table of Contents
Fetching ...

Can Language Models Laugh at YouTube Short-form Videos?

Dayoon Ko, Sangho Lee, Gunhee Kim

TL;DR

This paper addresses the challenge of enabling AI to understand and explain humor in short-form videos. It introduces ExFunTube, a 10,136-video multimodal dataset with timestamps and explanations, built via a GPT-3.5-based filtering pipeline and AMT annotations. A zero-shot video-to-text prompting approach converts video content into structured text that LLMs can reason about to produce humor explanations. Across model-based scores, rationale localization, and human evaluations, the authors show the prompting approach improves humor explanation accuracy, though human-level performance remains out of reach and future work is needed to close the gap.

Abstract

As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains, such as speeches or sitcoms, and mostly focus on verbal cues. We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.

Can Language Models Laugh at YouTube Short-form Videos?

TL;DR

This paper addresses the challenge of enabling AI to understand and explain humor in short-form videos. It introduces ExFunTube, a 10,136-video multimodal dataset with timestamps and explanations, built via a GPT-3.5-based filtering pipeline and AMT annotations. A zero-shot video-to-text prompting approach converts video content into structured text that LLMs can reason about to produce humor explanations. Across model-based scores, rationale localization, and human evaluations, the authors show the prompting approach improves humor explanation accuracy, though human-level performance remains out of reach and future work is needed to close the gap.

Abstract

As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains, such as speeches or sitcoms, and mostly focus on verbal cues. We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.
Paper Structure (20 sections, 1 equation, 18 figures, 4 tables)

This paper contains 20 sections, 1 equation, 18 figures, 4 tables.

Figures (18)

  • Figure 1: An example from the ExFunTube dataset. We curate funny short-form videos in various domains through a filtering pipeline that verifies both verbal and visual elements contributing to humor. Each video is annotated with timestamps and explanations for funny moments. In this example, three funny moments are identified.
  • Figure 2: The video filtering pipeline selects multimodal funny videos. Red boxes display the actual prompts provided to GPT-3.5. See the details in § \ref{['sec:filtering']}. (a) We generate a transcript and a caption from the input video. (b) Via GPT-3.5 prompting, we filter out the video that is not funny from the transcript and caption. (c) The video is accepted if it is funny from both the transcript and caption but not from the transcript only, since its humor is multimodal. (d) GPT-3.5 generates humor explanations with or without the video caption. We remove the videos if they are too similar since their humor is not multimodal. Examples for each case are presented in the Appendix.
  • Figure 3: (a) A zero-shot video-to-text prompting for converting video content into fine-grained text (§ \ref{['sec:v2t']}). For the visual modality, the video is first divided into $N$ segments, for each of which many possible captions are generated, and the best one is chosen finally. For audio modality, a transcript with speaker separation and sound tags are obtained. (b) The fine-grained text is configured as an input prompt to LLMs (§ \ref{['sec:prompt']}).
  • Figure 4: Results of human preference: comparing GPT-3.5 with our prompting to text-only GPT-3.5, MAF, and Gold, respectively.
  • Figure 5: Explanation performance according to humor taxonomy. We categorize all videos into 20 humor classes and compare the performance of eight different baselines in terms of the SentBERT score. The humor taxonomy is arranged in descending order of proportion in our dataset.
  • ...and 13 more figures