Table of Contents
Fetching ...

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

TL;DR

Understanding multilingual T2I fairness, the paper reveals that translation errors in CoCo-CroLa can produce false negatives in cross-language image generation. It introduces a text-domain similarity metric $\Delta\mathrm{SEM}$ and a correction framework to quantify how concept translations affect the image-domain correctness metric $X_c$ and its change $\Delta X_c$, supported by human-verified corrections for JA, ZH, and ES and a pseudocorrection study. Across multiple diffusion models, results reveal that larger $\Delta\mathrm{SEM}$ strongly predicts increases in $\Delta X_c$, especially for AltDiffusion, highlighting that textual alignment drives visual outcomes. The work offers v1.1 corrections, analyzes practical translation decisions, and provides guidance for designing robust multilingual T2I benchmarks that account for translation variation and perceptual blind spots.

Abstract

Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

TL;DR

Understanding multilingual T2I fairness, the paper reveals that translation errors in CoCo-CroLa can produce false negatives in cross-language image generation. It introduces a text-domain similarity metric and a correction framework to quantify how concept translations affect the image-domain correctness metric and its change , supported by human-verified corrections for JA, ZH, and ES and a pseudocorrection study. Across multiple diffusion models, results reveal that larger strongly predicts increases in , especially for AltDiffusion, highlighting that textual alignment drives visual outcomes. The work offers v1.1 corrections, analyzes practical translation decisions, and provides guidance for designing robust multilingual T2I benchmarks that account for translation variation and perceptual blind spots.

Abstract

Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.
Paper Structure (25 sections, 4 equations, 6 figures, 4 tables)

This paper contains 25 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The CoCo-CroLa benchmark mistranslated concepts such as bike in JA and suit in ZH. With correct translations (right) AltDiffusion does in fact "possess" them; originally (left) they were false negatives.
  • Figure 2: Scatterplots showing the impact of the corrections to each concept in JA, ZH, and ES on the conceptwise improvement to the CCCL correctness score, $\Delta X_c$, as a function of $\Delta\mathrm{SEM}$. Slopes $m$ at bottom-right in bold.
  • Figure 3: Languages with a high correlation between textual correction significance and image improvement (PCC) are more "well-understood" by the model ($X_c$), for both real- and pseudo-corrections.
  • Figure 4: Histograms for the error counts in JA, ZH, and ES vs $\Delta_{SEM}$, colored by error type. From lightest, they are F:formality, C:commonality, A:ambiguity, T:transliteration, IS:incoming sense error, OS:outgoing sense error. The error types are defined in \ref{['sec:errtype']}. Severe error types will exhibit more rightward distributional mass.
  • Figure 5: Qualitative examples of selected mistranslated concepts found in Coco-CroLa generated by AltDiffusion and multiple versions of Stable Diffusion - Top left: "Rock" in Japanese, Top right: "Suit" in Chinese, Bottom left: "Tent" in Spanish, Bottom right: "Table" in Chinese. Noticeably, we observe that T2I models such as Stable Diffusion 2 do not benefit from correcting the translations, as their outputs in the aforementioned languages remain irrelevant similarly to using random prompts.
  • ...and 1 more figures