Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion
Jiabao Shi, Minfeng Qi, Lefeng Zhang, Di Wang, Yingjie Zhao, Ziying Li, Yalong Xing, Ningran Li
TL;DR
The paper tackles the gap between broad generalization and domain-specific precision in text-to-image generation by introducing a modular, multi-agent reinforcement learning framework with domain-specialized agents for architecture, portrait, and landscape. It couples a text enhancement module and an image generation module under PPO optimization, leveraging cross-modal contrastive learning, bidirectional attention, and iterative text-image feedback to align semantics across modalities. Key contributions include a specialized agent architecture, PPO-based cross-modal optimization with a composite reward, and a suite of advanced fusion strategies (notably Transformer-based fusion) along with a unified evaluation framework capturing text quality, image fidelity, and cross-modal consistency. The findings show that specialization coupled with learned coordination can enrich content and domain-specific fidelity, with Transformer fusion delivering the best practical balance, though reinforcement learning in a multi-agent, cross-modal setting remains challenging due to non-stationarity and evaluation gaps.
Abstract
Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.
