Table of Contents
Fetching ...

VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model

Tao Zhang, Jia-Shu Pan, Ruiqi Feng, Tailin Wu

TL;DR

VFScale tackles the challenge of scalable intrinsic reasoning with diffusion models by making the energy function learned via diffusion act as an internal verifier, removing the need for external scoring. It introduces Monotonic-Regression Negative Contrastive Learning (MRNCL) and KL regularization to shape a smooth, performance-correlated energy landscape, and a hybrid Monte Carlo Tree Search (hMCTS) to efficiently exploit increased inference budgets. Across Maze and Sudoku, VFScale dramatically improves test-time scalability, enabling large problem instances to be solved where naive diffusion fails, and achieving energy-guided verifiability that approaches a near-perfect dense oracle. This approach offers a practical path to robust, verifier-free reasoning in diffusion models and is applicable to broader diffusion-based reasoning tasks; code is released for reproducibility.

Abstract

Inspired by human SYSTEM 2 thinking, LLMs excel at complex reasoning tasks via extended Chain-of-Thought. However, similar test-time scaling for diffusion models to tackle complex reasoning remains largely unexplored. From existing work, two primary challenges emerge in this setting: (i) the dependence on an external verifier indicating a notable gap from intrinsic reasoning of human intelligence without any external feedback, and (ii) the lack of an efficient search algorithm. In this paper, we introduce the Verifier-free Test-time Scalable Diffusion Model (VFScale) to achieve scalable intrinsic reasoning, which equips number-of-sample test-time scaling with the intrinsic energy function of diffusion models as the verifier. Concretely, VFScale comprises two key innovations to address the aforementioned challenges. On the training side, VFScale consists of a novel MRNCL loss and a KL regularization to improve the energy landscape, ensuring that the learned energy function itself serves as a reliable verifier. On the inference side, VFScale integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS) to improve search efficiency. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of VFScale's training objective and scalable inference method. In particular, trained with Maze sizes of up to $6\times6$, our VFScale solves 88% of Maze problems with much larger sizes of $15\times15$, while standard diffusion models completely fail. The code can be found at https://github.com/AI4Science-WestlakeU/VFScale.

VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model

TL;DR

VFScale tackles the challenge of scalable intrinsic reasoning with diffusion models by making the energy function learned via diffusion act as an internal verifier, removing the need for external scoring. It introduces Monotonic-Regression Negative Contrastive Learning (MRNCL) and KL regularization to shape a smooth, performance-correlated energy landscape, and a hybrid Monte Carlo Tree Search (hMCTS) to efficiently exploit increased inference budgets. Across Maze and Sudoku, VFScale dramatically improves test-time scalability, enabling large problem instances to be solved where naive diffusion fails, and achieving energy-guided verifiability that approaches a near-perfect dense oracle. This approach offers a practical path to robust, verifier-free reasoning in diffusion models and is applicable to broader diffusion-based reasoning tasks; code is released for reproducibility.

Abstract

Inspired by human SYSTEM 2 thinking, LLMs excel at complex reasoning tasks via extended Chain-of-Thought. However, similar test-time scaling for diffusion models to tackle complex reasoning remains largely unexplored. From existing work, two primary challenges emerge in this setting: (i) the dependence on an external verifier indicating a notable gap from intrinsic reasoning of human intelligence without any external feedback, and (ii) the lack of an efficient search algorithm. In this paper, we introduce the Verifier-free Test-time Scalable Diffusion Model (VFScale) to achieve scalable intrinsic reasoning, which equips number-of-sample test-time scaling with the intrinsic energy function of diffusion models as the verifier. Concretely, VFScale comprises two key innovations to address the aforementioned challenges. On the training side, VFScale consists of a novel MRNCL loss and a KL regularization to improve the energy landscape, ensuring that the learned energy function itself serves as a reliable verifier. On the inference side, VFScale integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS) to improve search efficiency. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of VFScale's training objective and scalable inference method. In particular, trained with Maze sizes of up to , our VFScale solves 88% of Maze problems with much larger sizes of , while standard diffusion models completely fail. The code can be found at https://github.com/AI4Science-WestlakeU/VFScale.

Paper Structure

This paper contains 45 sections, 14 equations, 14 figures, 23 tables, 1 algorithm.

Figures (14)

  • Figure 1: Visualizations of Maze training data and solutions generated by hMCTS denoising of our VFScale framework.
  • Figure 2: Overview of VFScale. This figure illustrates the key aspects of VFScale by contrasting its training and inference strategies with those of the previous method. (1) To qualify the intrinsic energy of diffusion models as a verifier, VFScale introduces ${\mathcal{L}}_\text{MRNCL}$ and ${\mathcal{L}}_\text{KL}$ to improve the energy landscape during training. (2) In order for a higher search efficiency, VFScale proposes hybrid Monte Carlo Tree Search (hMCTS) that achieves a balance between best-of-$N$ and MCTS.
  • Figure 3: Scalability of different approaches on Maze and Sudoku.
  • Figure 4: Comparison of the L2 distances between the solutions obtained by different training methods and the ground truth at various denoising steps.
  • Figure 5: The model architecture for VFScale on Sudoku task. The energy value is computed using the L2 norm of the final predicted output similar to du2023reduce, while the output is directly used as noise prediction for the diffusion baseline.
  • ...and 9 more figures