Table of Contents
Fetching ...

AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Arman Zarei, Jiacheng Pan, Matthew Gwilliam, Soheil Feizi, Zhenheng Yang

TL;DR

AgentComp addresses the lack of explicit supervision for composition in text-to-image diffusion by autonomously constructing a contrastive, composition-focused dataset with an agentic LLM orchestrator. It then trains models via Agent Preference Optimization, a distance-aware objective that emphasizes faithful composition over visually similar but incorrect trajectories. Across multiple base models, AgentComp achieves state-of-the-art performance on compositional benchmarks while preserving image quality and even improving text rendering, demonstrating strong generalization. The approach highlights the value of agentic data generation and distance-aware fine-tuning for robust, compositionally grounded text-to-image synthesis.

Abstract

Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.

AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

TL;DR

AgentComp addresses the lack of explicit supervision for composition in text-to-image diffusion by autonomously constructing a contrastive, composition-focused dataset with an agentic LLM orchestrator. It then trains models via Agent Preference Optimization, a distance-aware objective that emphasizes faithful composition over visually similar but incorrect trajectories. Across multiple base models, AgentComp achieves state-of-the-art performance on compositional benchmarks while preserving image quality and even improving text rendering, demonstrating strong generalization. The approach highlights the value of agentic data generation and distance-aware fine-tuning for robust, compositionally grounded text-to-image synthesis.

Abstract

Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionalityaccurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image qualitya common drawback in prior approachesand even generalizes to other capabilities not explicitly trained for, such as text rendering.

Paper Structure

This paper contains 31 sections, 18 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: AgentComp significantly enhances the compositional abilities of text-to-image generative models, improving text–image alignment while preserving image quality and even boosting capabilities such as text rendering, despite not being explicitly trained for it.
  • Figure 2: Motivation for correcting compositional trajectories. During the denoising trajectory for a compositional prompt, the model is not explicitly trained to avoid visually similar paths that miss certain compositional details.
  • Figure 3: Illustration of the agentic orchestration. The orchestrator collaborates with specialized agents to generate a positive image, synthesize contrastive prompts, produce corresponding negative images, and rank them by compositional distance.
  • Figure 4: Example scenario of the Image Generation Agent. The agent employs iterative reasoning and tool calls to produce a compositionally accurate image that aligns with the given prompt.
  • Figure 5: Example from the dataset generated by the agentic orchestra. The dataset includes high-quality samples, with reference image that accurately capture compositional details in the given prompt, along with negative samples created by subtly altering those details in the reference text–image pair.
  • ...and 15 more figures