VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Jianing Qi; Hao Tang; Zhigang Zhu

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Jianing Qi, Hao Tang, Zhigang Zhu

TL;DR

VerifierQ introduces offline Q-learning into verifier models to improve long-horizon reasoning at test time. By combining a modified Bellman update, Implicit Q-learning for large action spaces, and Conservative Q-learning to curb overestimation, it enables parallel Q-value estimation at the utterance level. Empirical results on GSM8K and MATH with TinyLlama-1.1B show superior performance over SFT-based verifiers and robustness across settings, highlighting the potential for RL-empowered verifiers to enhance robust reasoning and efficiency. The work paves the way for integrating actor-critic style architectures into LLMs, combining generation and evaluation for more capable AI systems.

Abstract

Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

TL;DR

Abstract

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)