Table of Contents
Fetching ...

Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement

Suchae Jeong, Inseong Choi, Youngsik Yun, Jihie Kim

TL;DR

Culture-TRIP addresses the misalignment of culture nouns in text-to-image generation by grounding prompts in retrieved cultural context and visual details and then iteratively refining prompts with large language models. The method retrieves knowledge from Wikipedia and the Web, scores refinements along cultural-context and visual-detail criteria, and stops when a threshold is reached or after five iterations. Across eight countries and 25 culture nouns per country, human judges and automatic metrics show that the approach substantially improves alignment for underrepresented culture nouns, with the strongest gains observed for the fully refined C-TRIP_5 configuration. The work demonstrates that knowledge-grounded, iterative prompt refinement can reduce cultural bias in TT-image generation without model fine-tuning, offering a scalable path toward more accurate and respectful cultural representations.

Abstract

Text-to-Image models, including Stable Diffusion, have significantly improved in generating images that are highly semantically aligned with the given prompts. However, existing models may fail to produce appropriate images for the cultural concepts or objects that are not well known or underrepresented in western cultures, such as `hangari' (Korean utensil). In this paper, we propose a novel approach, Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement (Culture-TRIP), which refines the prompt in order to improve the alignment of the image with such culture nouns in text-to-image models. Our approach (1) retrieves cultural contexts and visual details related to the culture nouns in the prompt and (2) iteratively refines and evaluates the prompt based on a set of cultural criteria and large language models. The refinement process utilizes the information retrieved from Wikipedia and the Web. Our user survey, conducted with 66 participants from eight different countries demonstrates that our proposed approach enhances the alignment between the images and the prompts. In particular, C-TRIP demonstrates improved alignment between the generated images and underrepresented culture nouns. Resource can be found at https://shane3606.github.io/Culture-TRIP.

Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement

TL;DR

Culture-TRIP addresses the misalignment of culture nouns in text-to-image generation by grounding prompts in retrieved cultural context and visual details and then iteratively refining prompts with large language models. The method retrieves knowledge from Wikipedia and the Web, scores refinements along cultural-context and visual-detail criteria, and stops when a threshold is reached or after five iterations. Across eight countries and 25 culture nouns per country, human judges and automatic metrics show that the approach substantially improves alignment for underrepresented culture nouns, with the strongest gains observed for the fully refined C-TRIP_5 configuration. The work demonstrates that knowledge-grounded, iterative prompt refinement can reduce cultural bias in TT-image generation without model fine-tuning, offering a scalable path toward more accurate and respectful cultural representations.

Abstract

Text-to-Image models, including Stable Diffusion, have significantly improved in generating images that are highly semantically aligned with the given prompts. However, existing models may fail to produce appropriate images for the cultural concepts or objects that are not well known or underrepresented in western cultures, such as `hangari' (Korean utensil). In this paper, we propose a novel approach, Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement (Culture-TRIP), which refines the prompt in order to improve the alignment of the image with such culture nouns in text-to-image models. Our approach (1) retrieves cultural contexts and visual details related to the culture nouns in the prompt and (2) iteratively refines and evaluates the prompt based on a set of cultural criteria and large language models. The refinement process utilizes the information retrieved from Wikipedia and the Web. Our user survey, conducted with 66 participants from eight different countries demonstrates that our proposed approach enhances the alignment between the images and the prompts. In particular, C-TRIP demonstrates improved alignment between the generated images and underrepresented culture nouns. Resource can be found at https://shane3606.github.io/Culture-TRIP.

Paper Structure

This paper contains 40 sections, 3 equations, 23 figures, 14 tables.

Figures (23)

  • Figure 1: Comparison between Stable Diffusion with and without our proposed approach, C-TRIP. (a) shows an image of a hangari from Wikipedia. (b) is an image generated by Stable Diffusion 2, while (c) shows an image generated with our approach. The additional knowledge about hangari (highlighted in red) in (c) helps the model generate an image that more closely resembles the actual hangari.
  • Figure 2: C-TRIP Overview. First, retrieve cultural contexts (cultural background, purpose) and visual details related to the culture nouns as described in Section \ref{['sec:retrieve']}. Then, refining the prompt based on the obtained information. We iteratively evaluate and refine the prompt as described in Section \ref{['sec:refine']}.
  • Figure 3: Qualitative comparison of C-TRIP ablated configurations compared to Base Prompt. The six columns can be divided into two groups: Relatively UC nouns (left four columns) and RC nouns (right two columns). The left group needed C-TRIP to introduce culture nouns that were underrepresented in Text-to-Image models, while the right group had to recall what they already knew through the additional information provided.
  • Figure 4: A box plot illustrating the normalized improvement scores for each group (Q1, Q2, Q3, and Q4). A score exceeding 0.45 signifies that the C-TRIP's guidelines enhance the image alignment of the Stable Diffusion 2 model. Notably, the Q1 group exhibits the highest performance improvement compared to the other groups.
  • Figure 5: Country Distribution in Q1 group.
  • ...and 18 more figures