Table of Contents
Fetching ...

Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding

Tuo Zhang, Tiantian Feng, Yibin Ni, Mengqin Cao, Ruying Liu, Katharine Butler, Yanjun Weng, Mi Zhang, Shrikanth S. Narayanan, Salman Avestimehr

TL;DR

This work introduces the Pun Rebus Art Dataset, a large-scale, bilingual benchmark of Chinese pun rebus art designed to test whether vision-language models can identify visual cues, map them to culturally specific symbols, and generate coherent explanations. It demonstrates that state-of-the-art VLMs exhibit notable gaps in visual salience spotting, symbolic reasoning, and bias-free explanations, with only modest gains from few-shot prompting and substantial improvements from targeted fine-tuning. The study provides a detailed evaluation protocol, sharing data and prompts to foster comparable benchmarking and highlight the need for cross-cultural knowledge integration in multimodal models. By exposing these limitations, the paper argues for more diverse training data and culturally informed evaluation to promote inclusive AI that understands heritage-rich content beyond English-language corpora.

Abstract

Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.

Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding

TL;DR

This work introduces the Pun Rebus Art Dataset, a large-scale, bilingual benchmark of Chinese pun rebus art designed to test whether vision-language models can identify visual cues, map them to culturally specific symbols, and generate coherent explanations. It demonstrates that state-of-the-art VLMs exhibit notable gaps in visual salience spotting, symbolic reasoning, and bias-free explanations, with only modest gains from few-shot prompting and substantial improvements from targeted fine-tuning. The study provides a detailed evaluation protocol, sharing data and prompts to foster comparable benchmarking and highlight the need for cross-cultural knowledge integration in multimodal models. By exposing these limitations, the paper argues for more diverse training data and culturally informed evaluation to promote inclusive AI that understands heritage-rich content beyond English-language corpora.

Abstract

Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.
Paper Structure (26 sections, 2 equations, 8 figures, 2 tables)

This paper contains 26 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The illustration of the chain of thought on understanding the Chinese pun rebus. The example artwork uses a horse and a monkey to construct the pun "马上封侯" (mǎ shàng fēng hóu), which means "May you instantly become a marquis" in English.
  • Figure 2: An example data sample and category distribution of the Pun Rebus Art dataset. We offer both English and Chinese versions of the data annotation in the proposed dataset. The dataset querying system is available on http://niyibin.org/punrebus/punrebus_main_en.php.
  • Figure 3: Illustration of the three evaluation tasks using an 18th-century Chinese ceramic as an example. Bold marks the salient elements. Symbolic Matching probes the model's understanding of the artwork's implied meanings. Element Identification asks what catches the model's attention most in the artwork. Expression Understanding delves into the rationale behind the model's interpretations.
  • Figure 4: An example of expression understanding generated by GPT-4o and Gemini Pro, including the expert review and the expert-provided answer for this artwork. Errors are highlighted in red.
  • Figure 5: The example questionnaire for an artwork image to the crowd-workers. The first question is related to the symbolic matching task, and the second question is related to the element identification task.
  • ...and 3 more figures