Table of Contents
Fetching ...

Chumor 2.0: Towards Benchmarking Chinese Humor Understanding

Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Rada Mihalcea, Naihao Deng

TL;DR

Chumor tackles the gap in Chinese humor understanding by constructing the largest Chinese humor explanation dataset derived from RZB and benchmarking ten LLMs on explanation quality. The work demonstrates that even state-of-the-art Chinese-capable LLMs struggle to explain jokes, with best accuracies around 60% and MCC near 0.3, far from human performance (~78% accuracy, MCC 0.60). Notably, chain-of-thought prompting often degrades performance, while human-annotated explanations outperform GPT-4o and ERNIE4-turbo in preference tests. The dataset, annotation protocol, and comprehensive analyses provide a benchmark and diagnostic for culturally specific humor understanding, highlighting clear room for progress in non-English humor reasoning. Applications include advancing multilingual humor evaluation and guiding future model development toward culturally aware reasoning.

Abstract

Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets. Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo. We release Chumor at https://huggingface.co/datasets/dnaihao/Chumor, our project page is at https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at https://github.com/dnaihao/Chumor-dataset.

Chumor 2.0: Towards Benchmarking Chinese Humor Understanding

TL;DR

Chumor tackles the gap in Chinese humor understanding by constructing the largest Chinese humor explanation dataset derived from RZB and benchmarking ten LLMs on explanation quality. The work demonstrates that even state-of-the-art Chinese-capable LLMs struggle to explain jokes, with best accuracies around 60% and MCC near 0.3, far from human performance (~78% accuracy, MCC 0.60). Notably, chain-of-thought prompting often degrades performance, while human-annotated explanations outperform GPT-4o and ERNIE4-turbo in preference tests. The dataset, annotation protocol, and comprehensive analyses provide a benchmark and diagnostic for culturally specific humor understanding, highlighting clear room for progress in non-English humor reasoning. Applications include advancing multilingual humor evaluation and guiding future model development toward culturally aware reasoning.

Abstract

Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets. Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo. We release Chumor at https://huggingface.co/datasets/dnaihao/Chumor, our project page is at https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at https://github.com/dnaihao/Chumor-dataset.

Paper Structure

This paper contains 51 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: The accuracy of different models' test results in the DP and CoT settings. ERNIE4-turbo and Gemini1.5-pro achieve the highest accuracy of 60.3%.
  • Figure 2: DP accuracy on different joke types (%). We highlight that model performance varies significantly across different joke types.
  • Figure 3: Over-analyzing example by GPT-4o. The GPT-4o model chooses the correct answer in the DP prompting, but chooses the incorrect answer due to over-analyzing in the CoT prompting.
  • Figure 4: Annotated preference for whether human explanation is preferred ("Human wins") or the explanation from LLMs is preffered ("LLM wins"). Humans' explanation is significantly preferred over LLMs'.
  • Figure 5: Distribution of error types for GPT-4o and ERNIE4-turbo. We sample 200 examples to calculate the distribution of these error types. We note that each example may correspond to multiple error types. We highlight that ERNIE4-turbo demonstrates a lower error rate on cultural jokes, while GPT-4o demonstrates a lower error rate on contextual or pun-based jokes.
  • ...and 11 more figures