Table of Contents
Fetching ...

Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

Jiabao Shi, Minfeng Qi, Lefeng Zhang, Di Wang, Yingjie Zhao, Ziying Li, Yalong Xing, Ningran Li

TL;DR

The paper tackles the gap between broad generalization and domain-specific precision in text-to-image generation by introducing a modular, multi-agent reinforcement learning framework with domain-specialized agents for architecture, portrait, and landscape. It couples a text enhancement module and an image generation module under PPO optimization, leveraging cross-modal contrastive learning, bidirectional attention, and iterative text-image feedback to align semantics across modalities. Key contributions include a specialized agent architecture, PPO-based cross-modal optimization with a composite reward, and a suite of advanced fusion strategies (notably Transformer-based fusion) along with a unified evaluation framework capturing text quality, image fidelity, and cross-modal consistency. The findings show that specialization coupled with learned coordination can enrich content and domain-specific fidelity, with Transformer fusion delivering the best practical balance, though reinforcement learning in a multi-agent, cross-modal setting remains challenging due to non-stationarity and evaluation gaps.

Abstract

Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.

Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

TL;DR

The paper tackles the gap between broad generalization and domain-specific precision in text-to-image generation by introducing a modular, multi-agent reinforcement learning framework with domain-specialized agents for architecture, portrait, and landscape. It couples a text enhancement module and an image generation module under PPO optimization, leveraging cross-modal contrastive learning, bidirectional attention, and iterative text-image feedback to align semantics across modalities. Key contributions include a specialized agent architecture, PPO-based cross-modal optimization with a composite reward, and a suite of advanced fusion strategies (notably Transformer-based fusion) along with a unified evaluation framework capturing text quality, image fidelity, and cross-modal consistency. The findings show that specialization coupled with learned coordination can enrich content and domain-specific fidelity, with Transformer fusion delivering the best practical balance, though reinforcement learning in a multi-agent, cross-modal setting remains challenging due to non-stationarity and evaluation gaps.

Abstract

Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.

Paper Structure

This paper contains 39 sections, 4 equations, 12 figures, 5 tables, 3 algorithms.

Figures (12)

  • Figure 1: Multi-agent Text-to-Image Generation Model framework diagram. The system comprises three main modules: Text Optimization Module (left) with foundation models and enhancement agents, Image Generation Module (bottom) with multi-method fusion strategies, and Multimodal Integration and Consistency Evaluation Module (right).
  • Figure 2: Text Processing Module with Multi-Agent Enhancement
  • Figure 3: Multi-Agent Image Generation with Fusion Methods
  • Figure 4: Multimodal Integration and Consistency Evaluation
  • Figure 5: Single-Agent vs Multi-Agent Text Generation
  • ...and 7 more figures