Table of Contents
Fetching ...

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat

TL;DR

This paper tackles the problem of Western-centric bias in text-to-image generation by introducing MosAIG, a Multi-Agent Image Generation framework that leverages diverse cultural personas via five LLM agents to produce culturally nuanced captions that drive image synthesis. It provides a new dataset of 9,000 multicultural person-landmark scenes across five countries, three age groups, two genders, 25 landmarks, and five languages, and demonstrates that multi-agent interactions yield improvements in Alignment, Aesthetics, Quality, and Knowledge compared to simple baselines, though at a cost to Fairness. The approach combines AltDiffusion and FLUX as image generators, uses a structured agent pipeline with iterative QA, and evaluates through automated metrics and human judgments, revealing actionable insights for future cross-cultural AI systems. The work emphasizes the practical impact of richer cultural representation in generated imagery and offers concrete steps toward broader multilingual support, better evaluation, and extended demographic coverage, while acknowledging limitations and ethical considerations.

Abstract

Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

TL;DR

This paper tackles the problem of Western-centric bias in text-to-image generation by introducing MosAIG, a Multi-Agent Image Generation framework that leverages diverse cultural personas via five LLM agents to produce culturally nuanced captions that drive image synthesis. It provides a new dataset of 9,000 multicultural person-landmark scenes across five countries, three age groups, two genders, 25 landmarks, and five languages, and demonstrates that multi-agent interactions yield improvements in Alignment, Aesthetics, Quality, and Knowledge compared to simple baselines, though at a cost to Fairness. The approach combines AltDiffusion and FLUX as image generators, uses a structured agent pipeline with iterative QA, and evaluates through automated metrics and human judgments, revealing actionable insights for future cross-cultural AI systems. The work emphasizes the practical impact of richer cultural representation in generated imagery and offers concrete steps toward broader multilingual support, better evaluation, and extended demographic coverage, while acknowledging limitations and ethical considerations.

Abstract

Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.

Paper Structure

This paper contains 40 sections, 2 equations, 47 figures, 2 tables.

Figures (47)

  • Figure 1: Most datasets used for training are dominated by singular cultural contexts (e.g., "Golden Gate Bridge" primarily depicted with American visitors or as a standalone monument). However, real-world scenarios often transcend cultural boundaries, with people from various backgrounds sharing spaces and experiences. Including images that combine multiple cultures, gender and age groups in a single scene allows models to develop a richer, more nuanced understanding of the world.
  • Figure 2: Overview of MosAIG, our framework for Multi-Agent Image Generation. The framework includes a multi-agent interaction model that generates an image caption from demographic information (person age, gender, country, landmark, and caption language), which is then used by an image generation model to create a multicultural image of a landmark and a person.
  • Figure 3: Our multi-agent models (Alt-En-M and Flux-M) surpass simple models (Alt-En-S and Flux-S) on Alignment, Aesthetics, Quality, and Fairness while performing worse in Knowledge. For ease of comparison, all the scores are normalized to a 0–1 scale. Higher scores are better for Alignment, Aesthetics, Quality, and Knowledge, while lower scores are better for Fairness.
  • Figure 4: Ablation studies on (a) person age, (b) person gender, (c) person country, (d) landmark country, (e) caption language using the best overall model, the Multi-agent English Flux-M (a-d) and Multi-agent Multilingual Alt-M (e). Performance across all five metrics—Alignment, Aesthetic, Quality, Knowledge, and Fairness—reveals significant variation across these demographic categories.
  • Figure 5: Alignment scores with the best overall model, Flux-M, over person and landmark country (left) and gender and age (right).
  • ...and 42 more figures