Table of Contents
Fetching ...

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

Kyle Buettner, Jacob T. Emmerson, Adriana Kovashka

TL;DR

This work tackles perceptual diversity across languages in vision-language modeling by addressing English-centric bias in cross-language data. It introduces a multimodal recaptioning framework that leverages a small native-speaker reference set and nearest-neighbor image guidance to rewrite English captions into target-language–reflective descriptions, then augments mCLIP training with these rewrites. On Japanese and German benchmarks, targeted recaptioning yields substantial retrieval gains (e.g., up to +2.4 mean recall; up to +4.4 on native-vs-translation error sets) and generalizes across datasets. The work also analyzes cross-language object description differences with WordNet-based taxonomies, revealing language-specific term distributions and informing future multilingual data collection.

Abstract

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures. Modern vision-language models (VLMs) gain understanding of images with text in different languages often through training on machine translations of English captions. However, this process relies on input content written from the perception of English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We specifically use a small amount of native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to augment captions to better reflect descriptions in a target language. When adding the resulting rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

TL;DR

This work tackles perceptual diversity across languages in vision-language modeling by addressing English-centric bias in cross-language data. It introduces a multimodal recaptioning framework that leverages a small native-speaker reference set and nearest-neighbor image guidance to rewrite English captions into target-language–reflective descriptions, then augments mCLIP training with these rewrites. On Japanese and German benchmarks, targeted recaptioning yields substantial retrieval gains (e.g., up to +2.4 mean recall; up to +4.4 on native-vs-translation error sets) and generalizes across datasets. The work also analyzes cross-language object description differences with WordNet-based taxonomies, revealing language-specific term distributions and informing future multilingual data collection.

Abstract

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures. Modern vision-language models (VLMs) gain understanding of images with text in different languages often through training on machine translations of English captions. However, this process relies on input content written from the perception of English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We specifically use a small amount of native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to augment captions to better reflect descriptions in a target language. When adding the resulting rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

Paper Structure

This paper contains 22 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: English captions (and their translations) do not capture the perceptual diversity of object and scene descriptions in other languages. They often fail to include cultural terms (e.g.bento box) and miss differences in native perspective (e.g. German emphasis of American football). More subtlely, we find that they differ from cross-language captions in the use of common nouns, for instance in Japanese STAIR yoshikawa2017stair where bread is more frequently described, especially with its contents (e.g.vegetables). Our multimodal recaptioning method considers these differences to enhance cross-lingual training data generation.
  • Figure 2: Our multimodal, LLM-based recaptioning method to adapt object descriptions before translation. For a set of images with only English captions, we generate new captions which better represent perceptual properties of a target language (e.g. Japanese). Each generation is guided by a reference example selected as the nearest neighbor in image similarity from a small set of native speaker data. Using the prompt shown, the multimodal LLM leverages the reference example and image context to infer targeted changes. This example shows the model adding the cultural term bento while also listing foods relevant to the input image. Text in brackets is not in the prompt.
  • Figure 3: Guidance from nearest neighbors can reveal subtle differences in object naming. A reference is chosen based on image similarity to acquire diverse descriptions of related concepts and scenes. Notice how truck may be described loosely as car in Japanese. Similarly, there are differences in object grouping and objects that are deemed salient (described).
  • Figure 4: When comparing English COCO vs. Japanese STAIR captions, object term distributions are found to vary across languages. For each supercategory, any term with count $>$ 150 is identified, and the union of terms across languages is shown. Note unique variation across common objects (e.g. counter, furniture, bread, sunglasses).
  • Figure 5: I2T retrievals that Targeted Image Recaptioning gets correct but default finetuning gets incorrect@10. The targeted method addresses unique differences in perspective and level of detail.
  • ...and 7 more figures