Multi-Agent Multimodal Models for Multicultural Text to Image Generation

Parth Bhalerao; Mounika Yalamarty; Brian Trinh; Oana Ignat

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat

TL;DR

This paper tackles the problem of Western-centric bias in text-to-image generation by introducing MosAIG, a Multi-Agent Image Generation framework that leverages diverse cultural personas via five LLM agents to produce culturally nuanced captions that drive image synthesis. It provides a new dataset of 9,000 multicultural person-landmark scenes across five countries, three age groups, two genders, 25 landmarks, and five languages, and demonstrates that multi-agent interactions yield improvements in Alignment, Aesthetics, Quality, and Knowledge compared to simple baselines, though at a cost to Fairness. The approach combines AltDiffusion and FLUX as image generators, uses a structured agent pipeline with iterative QA, and evaluates through automated metrics and human judgments, revealing actionable insights for future cross-cultural AI systems. The work emphasizes the practical impact of richer cultural representation in generated imagery and offers concrete steps toward broader multilingual support, better evaluation, and extended demographic coverage, while acknowledging limitations and ethical considerations.

Abstract

Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

TL;DR

Abstract

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (47)