RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Liyao Jiang; Ruichen Chen; Chao Gao; Di Niu

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu

TL;DR

RAISE attains state-of-the-art alignment while incurring fewer generated samples and VLM calls than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement.

Abstract

Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

TL;DR

Abstract

Paper Structure (21 sections, 12 equations, 9 figures, 7 tables)

This paper contains 21 sections, 12 equations, 9 figures, 7 tables.

Introduction
Related Work
Training-Free Inference-Time Scaling
Training-Based Inference-Time Scaling
Method
Requirement-Driven Adaptive Scaling
Multi-Action Mutational Refinement
Structured Tool-Grounded Verification
Experiments
Experimental Setup
Prompt-Image Alignment (GenEval)
Reasoning-Intensive Prompts (DrawBench)
Visual Comparisons
Ablation Studies
Conclusion
...and 6 more sections

Figures (9)

Figure 1: "McDonald’s Church," a challenging prompt. While other inference-time scaling methods struggle to improve over multiple rounds, RAISE refines T2I alignment using detailed requirement engineering and requirement-driven verification over multiple rounds.
Figure 2: Framework overview. RAISE employs diverse mutational refinement actions concurrently—including prompt rewriting, noise resampling, and instructional editing—to evolve candidates in each round. It operates as a multi-agent system composed of an analyzer, rewriter, and verifier: 1) Analyzer performs requirement analysis by extracting a structured and detailed checklist of prompt requirements based on user prompt and previous verification results; 2) Rewriter refines T2I generation prompts or produces image editing instructions to address unsatisfied requirements; 3) Verifier evaluates generated candidates via structured tool-grounded verification.
Figure 3: Visual comparison on GenEval ghosh2023geneval and DrawBench saharia2022photorealistic. RAISE improves prompt-image alignment on challenging prompts.
Figure 4: Pareto frontier and scaling performance. RAISE (red) achieves the highest GenEval ghosh2023geneval score with 41.9% fewer samples (18.6 vs. 32) and 88.6% fewer VLM calls (7.3 vs. 64). Unlike baselines that plateau or fail to improve with additional computation, RAISE maintains a strong performance–efficiency Pareto frontier and continues to improve as more samples are generated.
Figure 5: Visualization of the multi-round evolutionary refinement path in RAISE (prompt: "a photo of a bear above a clock"). This figure illustrates how RAISE explores the search space through evolutionary multi-action refinements such as prompt rewriting, resampling, and instructional editing to adaptively improve prompt–image alignment. Across successive rounds, the system evaluates requirement satisfaction and allocates additional refinements only where needed, demonstrating requirement-driven adaptive scaling.
...and 4 more figures

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

TL;DR

Abstract

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (9)