Table of Contents
Fetching ...

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

Jaihoon Kim, Taehoon Yoon, Jisung Hwang, Minhyuk Sung

TL;DR

This work introduces inference-time scaling for pretrained flow models by (i) converting the deterministic flow dynamics into an SDE to enable particle sampling, (ii) replacing the linear interpolant with a VP interpolant to broaden the search space and boost diversity, and (iii) proposing Rollover Budget Forcing to adaptively allocate compute across timesteps. The combined VP-SDE and interpolant conversion substantially improve reward alignment for flow models on compositional and quantity-aware image generation tasks, with RBF delivering the strongest gains and synergistic benefits when rewards are differentiable. The results demonstrate that stochastic generation and adaptive compute strategies can close the gap between flow and diffusion models for inference-time scaling, enabling high-quality, aligned outputs with limited compute. The approach provides practical pathways to enhance controllability of flow-based generators in complex prompting scenarios while highlighting trade-offs in compute overhead and robustness to misuse.

Abstract

We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models--offering faster generation and high-quality outputs in state-of-the-art image and video generative models--efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

TL;DR

This work introduces inference-time scaling for pretrained flow models by (i) converting the deterministic flow dynamics into an SDE to enable particle sampling, (ii) replacing the linear interpolant with a VP interpolant to broaden the search space and boost diversity, and (iii) proposing Rollover Budget Forcing to adaptively allocate compute across timesteps. The combined VP-SDE and interpolant conversion substantially improve reward alignment for flow models on compositional and quantity-aware image generation tasks, with RBF delivering the strongest gains and synergistic benefits when rewards are differentiable. The results demonstrate that stochastic generation and adaptive compute strategies can close the gap between flow and diffusion models for inference-time scaling, enabling high-quality, aligned outputs with limited compute. The approach provides practical pathways to enhance controllability of flow-based generators in complex prompting scenarios while highlighting trade-offs in compute overhead and robustness to misuse.

Abstract

We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models--offering faster generation and high-quality outputs in state-of-the-art image and video generative models--efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.

Paper Structure

This paper contains 51 sections, 3 theorems, 45 equations, 23 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

(Theorem 1 of Uehara et al. Uehara:2024Bridging). The induced distribution of the optimal policy in Eq. eq:soft_optimal_policy is the target distribution in Eq. eq:appendix_target_distribution.

Figures (23)

  • Figure 1: Diverse applications of our inference-time scaling method. Pretrained flow models struggle to generate images that align with complex prompts (left side of each case), whereas our inference-time scaling effectively extends their capabilities to achieve precise alignment (red box).
  • Figure 2: Comparison of Linear-ODE, Linear-SDE, and VP-SDE. The visualization shows how trajectories evolve under different dynamics starting from the same noise latent.
  • Figure 2: Quantitative results of aesthetic image generation.† denotes the given reward used in inference time. The best result in each row is highlighted in bold.
  • Figure 3: Sample diversity test using FLUX BlackForestLabs:2024Flux under linear and VP interpolant. All samples share the same initial latent. Prompt: "A steaming cup of coffee".
  • Figure 4: Interpolant log-SNR. Dashed lines show a reference SNR and timestep.
  • ...and 18 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Proposition 1
  • proof
  • Corollary 1
  • proof