Table of Contents
Fetching ...

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Jianing Qi, Hao Tang, Zhigang Zhu

TL;DR

VerifierQ introduces offline Q-learning into verifier models to improve long-horizon reasoning at test time. By combining a modified Bellman update, Implicit Q-learning for large action spaces, and Conservative Q-learning to curb overestimation, it enables parallel Q-value estimation at the utterance level. Empirical results on GSM8K and MATH with TinyLlama-1.1B show superior performance over SFT-based verifiers and robustness across settings, highlighting the potential for RL-empowered verifiers to enhance robust reasoning and efficiency. The work paves the way for integrating actor-critic style architectures into LLMs, combining generation and evaluation for more capable AI systems.

Abstract

Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

TL;DR

VerifierQ introduces offline Q-learning into verifier models to improve long-horizon reasoning at test time. By combining a modified Bellman update, Implicit Q-learning for large action spaces, and Conservative Q-learning to curb overestimation, it enables parallel Q-value estimation at the utterance level. Empirical results on GSM8K and MATH with TinyLlama-1.1B show superior performance over SFT-based verifiers and robustness across settings, highlighting the potential for RL-empowered verifiers to enhance robust reasoning and efficiency. The work paves the way for integrating actor-critic style architectures into LLMs, combining generation and evaluation for more capable AI systems.

Abstract

Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.

Paper Structure

This paper contains 24 sections, 4 theorems, 23 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

Let $Q^*$ be the optimal Q-function. The modified Bellman update converges to a unique fixed point.

Figures (6)

  • Figure 1: Illustration of State, Action (green), and Reward (orange) in a Math Problem. $+$ denotes correct (1) and $-$ denotes incorrect (0). A state generator produces an action (next solution step). The verifier assesses the existing state and action and outputs a probability of correctness.
  • Figure 2: Illustration of the VerifierQ architecture and modified Bellman update. Left: Bellman update, where $Q_{\theta}$ is updated via the TD target with $V$. Right: Relationships among $Q_{\theta}$, $Q_{\hat{\theta}}$, and $V_{\psi}$. $V_{\psi}$ is updated through CQL, $Q_{\theta}$ through the Bellman equation, and $Q_{\hat{\theta}}$ via soft update.
  • Figure 3: Illustration of our approach. Left: Orange line represents the overestimated Q-value $Q_{\hat{\theta}}$. Blue line indicates the data distribution $Q_{\theta}$. Minimizing the overestimation term brings the orange line down to the mean of data distribution. Right: Green line shows the lower expectile of the overestimated Q-value and purple line shows the upper expectile of the data Q-value. Minimizing those two can make orange line approaches the maximum Q-value under the data distribution.
  • Figure 4: Comparison of different methods on GSM8K (left) and MATH (right) using minimum evaluation. Rolling average over 20 steps. For VerifierQ we use $\tau_1 = 0.3$ (left) and $\tau=0.5$ (right).
  • Figure 5: Comparison of VerifierQ performance and Q-Learning methods
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 1: Convergence of Modified Bellman Update
  • proof
  • Theorem 2: IQL Optimality
  • proof : Proof Sketch
  • Lemma 3
  • Proposition 4: Modified CQL Bounds
  • Remark 5: Supporting Arguments and Intuitions