Table of Contents
Fetching ...

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

Hossein Mohebbi, Mohammed Abdulrahman, Yanting Miao, Pascal Poupart, Suraj Kothawade

TL;DR

Image-POSER presents a reflective RL framework that orchestrates a heterogeneous pool of T2I and I2I experts to execute long-form prompts. By coupling a DQN-based orchestration policy with a VLM critic and an LLM-powered command extractor, it enables dynamic task decomposition, retries, and adaptive reordering of expert calls. Empirical results show consistent improvements in alignment, fidelity, and aesthetics over strong baselines, with human evaluators favoring Image-POSER across generation and editing tasks. The work demonstrates that planning, critique, and refinement via reinforcement learning can empower general-purpose visual assistants without retraining base generators, albeit with considerations around cost, bias, and safety.

Abstract

Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

TL;DR

Image-POSER presents a reflective RL framework that orchestrates a heterogeneous pool of T2I and I2I experts to execute long-form prompts. By coupling a DQN-based orchestration policy with a VLM critic and an LLM-powered command extractor, it enables dynamic task decomposition, retries, and adaptive reordering of expert calls. Empirical results show consistent improvements in alignment, fidelity, and aesthetics over strong baselines, with human evaluators favoring Image-POSER across generation and editing tasks. The work demonstrates that planning, critique, and refinement via reinforcement learning can empower general-purpose visual assistants without retraining base generators, albeit with considerations around cost, bias, and safety.

Abstract

Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

Paper Structure

This paper contains 31 sections, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Select examples for complex long-form compositional prompts. Top-Left: Text-to-Image (T2I) generations from multiple baselines versus Image-POSER, which successfully integrates all compositional constraints (object counts, spatial relations, style fidelity). Top-Right: additional T2I scenes with fine-grained requirements. Bottom: Image-to-Image (I2I) edits where Image-POSER completes multi-step instructions (adding/removing/counting objects, altering viewpoint, and preserving layout) that single-shot models struggle with. Images cited in Appendix \ref{['app:datasets']}.
  • Figure 2: High-level example flow of Image-POSER's pipeline for image generation and editing, numbered step by step. Illustrates the RL loop from the environment, to the DQN agent selecting a visual expert, to the VLM outputting a reward and reflecting for future tasks.
  • Figure 3: Qualitative comparison of long-form prompts for generation (top) and editing (bottom). Baselines often fail on compositional constraints such as object counts, spatial relations, and object addition/removal. Image-POSER produces accurate, context-aware outputs that align with the instructions.
  • Figure 4: DQN Training Metrics. The agent's learning progress over 1000 training steps. The left plot shows the DQN loss, which converges steadily. The right plot shows the cumulative average reward, which increases and plateaus, indicating the agent has learned an effective policy.
  • Figure 5: Average reward scores assigned by the VLM critic to each expert during training.
  • ...and 7 more figures