Table of Contents
Fetching ...

Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba

Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Naihao Deng

TL;DR

This work constructs Chumor, a dataset sourced from Ruo Zhi Ba, a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes, and annotates explanations for each joke and evaluates human explanations against two state-of-the-art LLMs.

Abstract

Existing humor datasets and evaluations predominantly focus on English, lacking resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, a dataset sourced from Ruo Zhi Ba (RZB), a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes. We annotate explanations for each joke and evaluate human explanations against two state-of-the-art LLMs, GPT-4o and ERNIE Bot, through A/B testing by native Chinese speakers. Our evaluation shows that Chumor is challenging even for SOTA LLMs, and the human explanations for Chumor jokes are significantly better than explanations generated by the LLMs.

Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba

TL;DR

This work constructs Chumor, a dataset sourced from Ruo Zhi Ba, a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes, and annotates explanations for each joke and evaluates human explanations against two state-of-the-art LLMs.

Abstract

Existing humor datasets and evaluations predominantly focus on English, lacking resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, a dataset sourced from Ruo Zhi Ba (RZB), a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes. We annotate explanations for each joke and evaluate human explanations against two state-of-the-art LLMs, GPT-4o and ERNIE Bot, through A/B testing by native Chinese speakers. Our evaluation shows that Chumor is challenging even for SOTA LLMs, and the human explanations for Chumor jokes are significantly better than explanations generated by the LLMs.
Paper Structure (35 sections, 4 figures, 2 tables)

This paper contains 35 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Annotated preference for whether human explanation is better ("Human wins") or the explanation from LLMs is better ("LLM wins").
  • Figure 2: An Example of the Chinese joke from RZB (RZB, "弱智吧") where the explanation differs between humans and ChatGPT-4o (as of June 5th, 2024). Interestingly, when we provide both explanations to ChatGPT-4o, it decides the human explanation better explains the humor than its own explanation. This agrees with the choice of human raters who also decide the explanation from human better explains the humor. We include a further discussion of whether LLMs can serve as the preference annotator in \ref{['app-sec: llms-as-preference-annotator']}.
  • Figure 3: Distribution of error types for GPT-4o and ERNIE Bot. We sample 200 examples to calculate the distribution of these error types. We note that an example may correspond to multiple error types.
  • Figure 4: Preference annotation from GPT-4o. We prompt GPT-4o to choose a better explanation between its own explanation and the explanation written by human. We note that the GPT-4o's preference is significantly different from the human preference in \ref{['fig:gpt-4o-preference-eval']}.