Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen; Daoxuan Zhang; Xiangming Wang; Yungeng Liu; Haijin Zeng; Yongyong Chen

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen

Abstract

Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Abstract

Paper Structure (38 sections, 9 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 9 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Compositional Text-to-Image Generation
Vision Language Models and Agentic Frameworks
Method
Overview
Preliminaries: Flow Matching and Open-Loop Limitation
Parallel Rollout Search (PRS)
Prompt Optimization.
Latent Space Search Tree.
Simulation and Selection.
Agentic Flow Steering (AFS)
Linear Trajectory Projection.
Contrastive Semantic Energy Formulation.
Time-Scaled Velocity Modulation.
...and 23 more sections

Figures (13)

Figure 1: From a visual perspective, our AFS-Search provides a closed-loop generation paradigm to achieve precise spatial grounding generation.
Figure 2: Motivation of our AFS-Search. Open-loop generation follows a fixed, feed-forward sampling trajectory without intermediate feedback or correction while closed-loop generation introduces real-time visual feedback.
Figure 3: The framework of AFS-Search. The pipeline operates in four phases: (1) Prompt Optimization. A VLM rewrites the user prompt into an explicit instruction. (2) Generating Initial Structure. The FLUX.1-dev model generates an intermediate state up to a bifurcation point. (3) Parallel Rollout Search. A VLM Critic diagnoses the intermediate state to guide search. The system explores three branches: a Baseline Branch, an Exploration Branch, and a Corrective Branch by AFS. (4) Output & Feedback: The optimal trajectory is selected based on VLM scores. If the scores are lower than threshold, a global redesign loop is triggered.
Figure 4: Motivation for Parallel Rollout Search. While the standard open-loop trajectory (Base Branch) yields a sub-optimal alignment (Score: 8.5), our search mechanism actively explores alternative futures. By comparing the Corrective Branch (guided by AFS) and Exploration Branch against the baseline, the agent identifies and selects the optimal trajectory (Score: 9.5) that best matches the prompt.
Figure 5: VLM's scoring mechanism. A VLM-driven evaluation framework (ranging from -10 to +10) that balances prompt adherence (50%), relational logic (30%), and visual integrity (20%). It employs a granular penalty-bonus system to enforce semantic precision and reward aesthetic quality. Full prompt is in Supplyment \ref{['sec:Full Prompt Templates']}.
...and 8 more figures

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Abstract

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Authors

Abstract

Table of Contents

Figures (13)