$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Peihao Wang; Ruisi Cai; Zhen Wang; Hongyuan Mei; Qiang Liu; Pan Li; Zhangyang Wang

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang

TL;DR

This work proposes $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly, offering a cost-effective path to amplify LLM reasoning.

Abstract

Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations. $\nabla$-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

TL;DR

This work proposes

-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly, offering a cost-effective path to amplify LLM reasoning.

Abstract

-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations.

-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically,

-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.

Paper Structure (49 sections, 3 theorems, 18 equations, 4 figures, 5 tables, 4 algorithms)

This paper contains 49 sections, 3 theorems, 18 equations, 4 figures, 5 tables, 4 algorithms.

Introduction
Preliminaries
Notations.
Language Models and Reward Models.
Reasoning as Decision Making.
Existing Approaches.
Reasoning with Gradient-Driven Decoding
Overview.
Differentiable Textual Optimization
Objective.
Parameterization.
Iterative Decoding with DTO
Policy Improvement via DTO.
Rejection Sampling.
Test-Time Scaling.
...and 34 more sections

Key Result

Theorem 4.1

Suppose $\{\rho^{t}\}_{t \ge 0}$ denotes the Wasserstein gradient flow minimizing Eq. eqn:ppo in the distribution space with boundary conditions $\rho^{0} = \pi_{LLM}$ and $\rho^{\infty} = \rho^* = \mathop{\mathrm{arg\,min}}\limits_{\rho}\mathcal{L}_{PPO}(\rho)$. Then we can draw samples from $\rho^

Figures (4)

Figure 1: LLM reasoning can be formulated as a maximization problem over the reward landscape. (left) Traditional inference-time scaling methods are zeroth-order methods, sampling numerous candidate responses and evaluating them to identify higher-quality solutions. (right) This paper introduces a first-order method for inference-time scaling, where reward gradients are directly leveraged to guide the search process toward highly rewarding responses.
Figure 2: $\mathcal{\nabla}$-Reasoner: Decoding with DTO
Figure 3: A comparison of computational cost, measured by the number of model calls. Our method reduces costs by up to 40.2% compared to baselines.
Figure 4: Test-time scaling curves comparing our method with BoN and SC. We change the number of samples $N$ for BoN and SC and number of rollouts $N_{max}$ for our method. The results show $\mathcal{\nabla}$-Reasoner achieves superior performance with reduced cost across multiple models.

Theorems & Definitions (6)

Theorem 4.1
Proposition C.1
proof
Remark C.2
Theorem C.3: Restatement of Theorem \ref{['thm:fokker_plank_ppo']}
proof

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

TL;DR

Abstract

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)