Table of Contents
Fetching ...

Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

TL;DR

ReFocus addresses the persistent challenge of compositionality in text-to-image synthesis by proposing a training-free framework that combines explicit layout grounding with iterative self-refinement. It leverages an LLM to generate explicit object layouts $L=\{(l_i,s_i)\}_{i=1}^{k}$ with $s_i \in [0,1]^4$, and then performs layout-grounded generation followed by an object-centric, self-refining loop guided by a VLM-based judge. The three-phase pipeline—LLM-based layout generation, layout-grounding initial generation, and iterative refinement with a hybrid scene- and object-level evaluation—yields stronger prompt fidelity and perceptual quality on GenEval and HPS v2 while remaining training-free. The approach demonstrates that explicit compositional grounding combined with inference-time scaling can robustly improve object counts, attributes, and spatial relationships in complex scenes, offering a practical, user-friendly path for reliable T2I synthesis.

Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

Compositional Image Synthesis with Inference-Time Scaling

TL;DR

ReFocus addresses the persistent challenge of compositionality in text-to-image synthesis by proposing a training-free framework that combines explicit layout grounding with iterative self-refinement. It leverages an LLM to generate explicit object layouts with , and then performs layout-grounded generation followed by an object-centric, self-refining loop guided by a VLM-based judge. The three-phase pipeline—LLM-based layout generation, layout-grounding initial generation, and iterative refinement with a hybrid scene- and object-level evaluation—yields stronger prompt fidelity and perceptual quality on GenEval and HPS v2 while remaining training-free. The approach demonstrates that explicit compositional grounding combined with inference-time scaling can robustly improve object counts, attributes, and spatial relationships in complex scenes, offering a practical, user-friendly path for reliable T2I synthesis.

Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

Paper Structure

This paper contains 9 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of ReFocus. (1) LLM-based Layout Generation: The prompt is mapped to an explicit box layout $L$ and lightly regularized (2% border margin, $\delta{=}0.02$) to avoid truncation. (2) Layout-Grounding Generation: a diffusion model conditioned on $L$ samples $N$ drafts. (3) Iterative self-refinement: a hybrid re-ranking module, weighted by $\lambda$ selects best candidate and the refinement model iteratively refines and re-ranked candidates until a prompt consistent image is produced.
  • Figure 2: Average GenEval geneval score as the number of samples per prompt ($N$ in Best-of-$N$) increases.
  • Figure 3: Visual comparison with prior text-to-image models rombach2022ldmsdxl, a layout-grounding method gligen, and inference-time scaling approaches ma2025inferencereflectdit.
  • Figure 4: Visual comparison of our proposed mechanism. Here, inference scaling refers to the naive Best-of-$N$ strategy.