Table of Contents
Fetching ...

RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation

Viacheslav Vasilev, Julia Agafonova, Nikolai Gerasimenko, Alexander Kapitanov, Polina Mikhailova, Evelina Mironova, Denis Dimitrov

TL;DR

The paper targets the cultural-awareness gap in text-to-image generation, where English-centric biases can degrade quality and cause harm. It introduces RusCode, an expert-curated benchmark of 1250 complex prompts in Russian (with English translations) organized into 19 cultural categories, plus reference images. Four state-of-the-art T2I systems (Stable Diffusion 3, DALL-E 3, Kandinsky 3.1, YandexART 2) are evaluated through human studies, revealing persistent cultural-understanding gaps and model differences. They find CLIP-based metrics do not track human judgments, advocate culture-focused fine-tuning or retrieval-augmented generation, and publish RusCode under MIT license to spur further research and broader adoption.

Abstract

Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people's names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.

RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation

TL;DR

The paper targets the cultural-awareness gap in text-to-image generation, where English-centric biases can degrade quality and cause harm. It introduces RusCode, an expert-curated benchmark of 1250 complex prompts in Russian (with English translations) organized into 19 cultural categories, plus reference images. Four state-of-the-art T2I systems (Stable Diffusion 3, DALL-E 3, Kandinsky 3.1, YandexART 2) are evaluated through human studies, revealing persistent cultural-understanding gaps and model differences. They find CLIP-based metrics do not track human judgments, advocate culture-focused fine-tuning or retrieval-augmented generation, and publish RusCode under MIT license to spur further research and broader adoption.

Abstract

Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people's names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.

Paper Structure

This paper contains 36 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: 19 categories of Russian cultural code in our RusCode benchmark dataset. The images are generated by the Kandinsky 3.1 model arkhipkin2024kandinsky30technicalreportvladimir-etal-2024-kandinsky.
  • Figure 2: Examples of prompts from RusCode dataset with corresponding reference images
  • Figure 3: The ratio of the number of collected prompts by each category in the RusCode dataset.
  • Figure 4: Comparison of Russian cultural code generations for popular text-to-image models. Reference is an example of a real image with a specific cultural concept from RusCode dataset.
  • Figure 5: Human evaluation results of a side-by-side comparison between T2I model generations using text prompts from the RusCode dataset.
  • ...and 7 more figures