Table of Contents
Fetching ...

Semantic and Expressive Variation in Image Captions Across Languages

Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna

TL;DR

The paper demonstrates that image captions vary semantically and expressively across languages, challenging the assumption of homogeneous cross-cultural perception in vision-language tasks. By analyzing seven languages on the Crossmodal dataset and translating captions to English for fair comparison, the authors show that multilingual captions cover more content and exhibit richer expressive variation than monolingual ones. They extend these findings to model outputs (LLaVA and Vertex API) and reveal that multilingual fine-tuning yields broader cross-language robustness, suggesting multilingual data can enhance vision-language representations. The work advocates embracing linguistic and cultural diversity in dataset construction and model training to achieve more robust, inclusive vision systems.

Abstract

Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is measured using scene graphs, model embeddings, and linguistic taxonomies. For example, multilingual descriptions have on average 29.9% more objects, 24.5% more relations, and 46.0% more attributes than a set of monolingual captions. When prompted to describe images in different languages, popular models (e.g. LLaVA) inherit this bias and describe different parts of the image. Moreover, finetuning models on captions from one language performs best on corresponding test data from that language, while finetuning on multilingual data performs consistently well across all test data compositions. Our work points towards the need to account for and embrace the diversity of human perception in the computer vision community.

Semantic and Expressive Variation in Image Captions Across Languages

TL;DR

The paper demonstrates that image captions vary semantically and expressively across languages, challenging the assumption of homogeneous cross-cultural perception in vision-language tasks. By analyzing seven languages on the Crossmodal dataset and translating captions to English for fair comparison, the authors show that multilingual captions cover more content and exhibit richer expressive variation than monolingual ones. They extend these findings to model outputs (LLaVA and Vertex API) and reveal that multilingual fine-tuning yields broader cross-language robustness, suggesting multilingual data can enhance vision-language representations. The work advocates embracing linguistic and cultural diversity in dataset construction and model training to achieve more robust, inclusive vision systems.

Abstract

Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is measured using scene graphs, model embeddings, and linguistic taxonomies. For example, multilingual descriptions have on average 29.9% more objects, 24.5% more relations, and 46.0% more attributes than a set of monolingual captions. When prompted to describe images in different languages, popular models (e.g. LLaVA) inherit this bias and describe different parts of the image. Moreover, finetuning models on captions from one language performs best on corresponding test data from that language, while finetuning on multilingual data performs consistently well across all test data compositions. Our work points towards the need to account for and embrace the diversity of human perception in the computer vision community.
Paper Structure (25 sections, 7 figures, 16 tables)

This paper contains 25 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: People speaking different languages may caption images differently, noticing and emphasizing different aspects of the image. These examples are drawn from our user study. In this paper, we demonstrate that there are distributional differences between the concepts represented in different languages, in addition to the general variation in annotator subjectivity/noise. Illustrative example.
  • Figure 2: Semantic content evaluation. Captions of an image in different languages and their scene graphs, when unioned together produce more varied and complex scene graphs for multilingual distributions than monolingual ones. Captions from Vertex.
  • Figure 3: Instructions and examples presented to human evaluation participants for image captioning.
  • Figure 4: Scene graphs of captions unioned cumulatively from different languages lead to more coverage in objects, relations, and attributes.
  • Figure 5: Sample scene graphs across six images. "lang-$n$" indicates the scene graph generated for the $n$th caption in lang. "lang1-lang2-lang3" indicates the scene graph unioned from three scene graphs originally from each of the three languages.
  • ...and 2 more figures