An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

Simran Khanuja; Sathyanarayanan Ramamoorthy; Yueqi Song; Graham Neubig

An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig

TL;DR

This work tackles the problem of translating images across cultural boundaries, extending beyond words to visual content. It proposes three pipelines—e2e-instruct, cap-edit, and cap-retrieve—that combine image editing with caption-based language-model editing and retrieval, evaluated against a novel two-part dataset. The concept dataset (600 images across seven countries) tests cross-cultural substitutions, while the application dataset (100 images from education and literature) tests real-world task alignment. Human evaluation reveals that current image-editing models struggle to achieve culturally faithful transcreation, though LLMs and retrieval-based methods offer meaningful gains; results underscore the difficulty and establish a benchmark and resources for future progress in multimodal, culturally aware translation.

Abstract

Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://github.com/simran-khanuja/image-transcreation.

An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

TL;DR

Abstract

Paper Structure (39 sections, 26 figures, 1 table)

This paper contains 39 sections, 26 figures, 1 table.

Introduction
Pipelines for Image Transcreation
e2e-instruct: Instruction-based editing
cap-edit: Caption, text-edit, image-edit
cap-retrieve: Caption, edit, retrieve
Evaluation Dataset
Concept dataset
Application dataset
Why the two-part dataset?
Human Evaluation and Quantitative Metrics
Questions and Findings: Concept
Questions and Findings: Application
Quantitative Metrics
Related Work
Conclusion
...and 24 more sections

Figures (26)

Figure 1: Image transcreation as done in various applications today: a) Audiovisual (AV) media: where several changes were made to adapt Doraemon to the US context like adding crosses and Fs in grade sheets, or in Inside Out, where broccoli is replaced with bell peppers in Japan as a vegetable that children don't like; b) Education: where the same concepts are taught differently in different countries, using local currencies or celebration-themed worksheets; c) Advertisements: where the same product is packaged and marketed differently, like in Ferrero Rocher taking the shape of a lunar festival kite in China, and that of a Christmas tree elsewhere.
Figure 2: Pipelines to transcreate images:e2e-instruct takes as input the original image and a natural language instruction; cap-edit first captions the image, uses a LLM to edit the caption for cultural relevance, and edits the original image using the LLM-edit as instruction; and cap-retrieve uses this LLM-edit to retrieve a natural image from a country-specific image dataset. Given the unprecedented nature of this task, we create pipelines using pre-existing SOTA models, and benchmark them on our newly created test set.
Figure 3: Concept dataset: We select seven geographically diverse countries and universal categories that are cross-culturally comprehensive. Annotators native to selected countries give us 5 concepts and associated images that are culturally salient for the speaking population of their country.
Figure 4: Story text: My mom bought rice.
Figure 5: Human ratings for the concept dataset: Our primary goal is to test whether the edited image belongs to the same universal category as the original image (C1) and whether it increases cultural relevance (C3). We plot the count of images that can do both above (C1+C3), and observe that the best pipeline's performance ranges between 5% (Nigeria) to 30% (Japan).
...and 21 more figures

An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

TL;DR

Abstract

An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

Authors

TL;DR

Abstract

Table of Contents

Figures (26)