Table of Contents
Fetching ...

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

Yong Cao, Wenyan Li, Jiaang Li, Yifei Yuan, Antonia Karamolegkou, Daniel Hershcovich

TL;DR

This paper systematically probes GPT-4V's visual cultural awareness using the MaRVL benchmark through three tasks: caption classification, pairwise captioning, and culture tag selection. It investigates alignment between language and vision, extraction of fine-grained cultural features, and cross-cultural knowledge from visual input, using both automated metrics and human evaluation. Findings show GPT-4V generally excels in cultural concept identification and generates culturally richer descriptions than the MaRVL annotations, though performance drops for low-resource languages. The work highlights GPT-4V's potential for improving cultural benchmarks and datasets, while also identifying language gaps and outlining directions for future benchmark construction and cross-cultural evaluation.

Abstract

Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning, and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction.

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

TL;DR

This paper systematically probes GPT-4V's visual cultural awareness using the MaRVL benchmark through three tasks: caption classification, pairwise captioning, and culture tag selection. It investigates alignment between language and vision, extraction of fine-grained cultural features, and cross-cultural knowledge from visual input, using both automated metrics and human evaluation. Findings show GPT-4V generally excels in cultural concept identification and generates culturally richer descriptions than the MaRVL annotations, though performance drops for low-resource languages. The work highlights GPT-4V's potential for improving cultural benchmarks and datasets, while also identifying language gaps and outlining directions for future benchmark construction and cross-cultural evaluation.

Abstract

Pretrained large Vision-Language models have drawn considerable interest in recent years due to their remarkable performance. Despite considerable efforts to assess these models from diverse perspectives, the extent of visual cultural awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset, aiming to investigate its capabilities and limitations in visual understanding with a focus on cultural aspects. Specifically, we introduced three visual related tasks, i.e. caption classification, pairwise captioning, and culture tag selection, to systematically delve into fine-grained visual cultural evaluation. Experimental results indicate that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V proves to be more culturally relevant in image captioning tasks than the original MaRVL human annotations, suggesting a promising solution for future visual cultural benchmark construction.
Paper Structure (21 sections, 5 figures, 4 tables)

This paper contains 21 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of captions from human annotation and GPT-4V where description of GPT-4V is more culturally related than original human annotation, excelling not only in grasping cultural concept but also in capturing fine-grained cultural aspects.
  • Figure 2: Our cultural probing framework, include caption classification, pairwise captioning, and culture tag selection.
  • Figure 3: Reference-based evaluation comparison in the cultural pairwise captioning task.
  • Figure 4: Our proposed platform for human evaluation in cultural pairwise captioning task.
  • Figure 5: Additional case studies in the cultural pairwise caption generation task across languages, with cultural-relevant descriptions highlighted in blue.