Table of Contents
Fetching ...

Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi Yang, Jun Wang, Yue Zhang

TL;DR

Direct Value Optimization (DVO) tackles the brittleness of chain-of-thought reasoning in LLMs by using finely grained stepwise value signals rather than human-provided preferences. Framed as a stepwise MDP, DVO interprets the LLM as a soft $Q$-function and trains with a mean-squared-error objective against target values estimated via Monte Carlo Tree Search or an outcome-value predictor. Empirical results on math and commonsense reasoning show DVO consistently surpasses offline preference-based methods across multiple model sizes and benchmarks, including notable gains on GSM8K, MATH, and AGIEval-Math, as well as robust generalization to out-of-domain data. The work demonstrates that value signals provide more informative supervision for reasoning than pairwise preferences, yielding stronger, more stable improvements with fewer training steps.

Abstract

We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.

Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

TL;DR

Direct Value Optimization (DVO) tackles the brittleness of chain-of-thought reasoning in LLMs by using finely grained stepwise value signals rather than human-provided preferences. Framed as a stepwise MDP, DVO interprets the LLM as a soft -function and trains with a mean-squared-error objective against target values estimated via Monte Carlo Tree Search or an outcome-value predictor. Empirical results on math and commonsense reasoning show DVO consistently surpasses offline preference-based methods across multiple model sizes and benchmarks, including notable gains on GSM8K, MATH, and AGIEval-Math, as well as robust generalization to out-of-domain data. The work demonstrates that value signals provide more informative supervision for reasoning than pairwise preferences, yielding stronger, more stable improvements with fewer training steps.

Abstract

We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.

Paper Structure

This paper contains 26 sections, 2 theorems, 22 equations, 8 figures, 1 table.

Key Result

Proposition 1

(Proof in Appendix app:proposition_proof) In general maximum entropy reinforcement learning setting, a language model parameterized by $\pi_\theta$ can be seen as an optimal soft Q-function under some reward.

Figures (8)

  • Figure 1: Overview of Direct Value Optimization. Compared with other self-improvement methods, DVO stands out by using MCTS to generate self-explored data and directly aligning the policy model with value estimations. This provides an efficient self-improvement framework that finely tunes the model to maximize total expected rewards.
  • Figure 2: Illustration of Step-by-step reasoning process, each node represents a step with its corresponding reward and value estimation.
  • Figure 3: Different value estimation in DVO.
  • Figure 4: Ablation study on hyperparameters. The left figure demonstrates the effect of varying $\beta$ values during training, while the right figure highlights the impact of search iterations in MCTS.
  • Figure 5: volution of implicit rewards during DPO and DVO training. While the reward margin increases, the implicit reward of positive solutions decreases in DPO but increases in DVO.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof