Table of Contents
Fetching ...

GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Ashish Goswami, Satyam Kumar Modi, Santhosh Rishi Deshineni, Harman Singh, Prathosh A. P, Parag Singla

TL;DR

GraPE proposes a Generate-Plan-Edit framework to tackle compositional text-to-image synthesis by decomposing complex prompts into three stages: initial generation, object-centric planning via a Multi-Modal LLM, and sequential editing. The planner outputs an editable sequence of atomic steps, which an editing model executes to progressively align the image with the prompt; PixEdit (compositional editor) and Aurora serve as strong editing backbones. Empirical results across 13 T2I models and benchmarks (T2I Compbench, ConceptMix, Flickr-Bench) show substantial gains, especially for weaker models, and demonstrate that planning quality and editing fidelity jointly drive performance. The work highlights modularity, training-free plug-and-play integration, and offers insight into planning accuracy vs editing reliability, including cost and runtime analyses. This approach broadens practical T2I applicability by enabling reliable, compositional image synthesis without requiring extensive model fine-tuning.

Abstract

Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. $\href{https://dair-iitd.github.io/GraPE/}{https://dair-iitd.github.io/GraPE/}$

GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

TL;DR

GraPE proposes a Generate-Plan-Edit framework to tackle compositional text-to-image synthesis by decomposing complex prompts into three stages: initial generation, object-centric planning via a Multi-Modal LLM, and sequential editing. The planner outputs an editable sequence of atomic steps, which an editing model executes to progressively align the image with the prompt; PixEdit (compositional editor) and Aurora serve as strong editing backbones. Empirical results across 13 T2I models and benchmarks (T2I Compbench, ConceptMix, Flickr-Bench) show substantial gains, especially for weaker models, and demonstrate that planning quality and editing fidelity jointly drive performance. The work highlights modularity, training-free plug-and-play integration, and offers insight into planning accuracy vs editing reliability, including cost and runtime analyses. This approach broadens practical T2I applicability by enabling reliable, compositional image synthesis without requiring extensive model fine-tuning.

Abstract

Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models.

Paper Structure

This paper contains 46 sections, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 2: Proposed GraPE framework, a given text prompt is used to generate an initial image from T2I model, $I_g$ which is then fed into a MLLM based planner along with the text prompt which identifies the objects that are misaligned in the image and outputs a set of edit plans guided by few-shot prompting. The plans are executed as a series of edits over the initial image to produce the final image
  • Figure 3: Experimental results showcasing the maximum gain in DSG score by GraPE with both AURORA and PixEdit as editing models. The figure presents both DSG and DSG (w/o dependency) scores. The percentage gain is measured over DSG scores. The absolute values are presented in Table \ref{['table:t2i_compbench_bench_supp']} and \ref{['tab:flickr_bench_detailed']}
  • Figure 4: (a) Trend of GPT-QA score with increasing number of editing steps. (b) Average edit steps per plan for ConceptMix benchmark across the models
  • Figure 5: Results illustrating failure cases of Image-Editing model
  • Figure 6: System-prompt used with GraPE's MLLM Planner
  • ...and 4 more figures