Table of Contents
Fetching ...

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan

TL;DR

Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm, is proposed.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

TL;DR

Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm, is proposed.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.

Paper Structure

This paper contains 20 sections, 64 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Top left: PAG achieves state-of-the-art self-correction performance across diverse mathematical reasoning datasets. Bottom left: SCoRe always generates a second attempt regardless of confidence. Right: Our PAG framework employs selective revision through self-verification, revising only when the initial attempt is explicitly identified as wrong.
  • Figure 2: Overview of the Policy as Generative Verifier (PAG) framework. The LLM alternates between a policy role (generating solutions) and a generative verifier role (evaluating its own solutions) in a multi-turn process. This iterative refinement continues until self-verification is correct or a maximum number of turns is reached.
  • Figure 3: Training dynamics of PAG on Qwen2.5-1.5B-Instruct. Left: Answer change ratio quantifies model collapse as the proportion of responses where the second-turn answer differs from the first. Direct MultiTurn rapidly declines, indicating severe collapse. SCoRe partially alleviates this through two-stage RL, while PAG's selective revision mechanism effectively prevents collapse and achieves higher Acc.@t1 and Acc.@final; Middle: Acc.@t1; Right: Acc.@final.
  • Figure 4: Performance on RewardBench mathprm. Scores* are taken from RewardBench report.
  • Figure 5: PAG self-verify BoN outperforms majority voting. Left: PAG with 7B model on AIME2024; Right: PAG with 1.5B model on MATH500.
  • ...and 8 more figures