Table of Contents
Fetching ...

ReVeal: Self-Evolving Code Agents via Reliable Self-Verification

Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, Jing Bai

TL;DR

ReVeal addresses unreliable self-verification in reinforcement-learning-based code reasoning by introducing a multi-turn generation-verification framework with tool-assisted evaluation. It employs TAPO, a turn-aware credit mechanism that decomposes rewards into outcome, generation, and verification components and assigns them across token- and turn-level units to stabilize learning and prevent gaming. Empirically, ReVeal achieves superior Pass@k and enables 20+ inference-turn scaling from training on only 3 turns on LiveCodeBench, demonstrating robust extrapolation and co-evolution of code and test generation. The results suggest that explicitly optimizing verification signals yields deeper exploration, stronger verification capabilities, and a scalable pathway to autonomous, self-improving AI agents across tasks with verifiable rewards.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.

ReVeal: Self-Evolving Code Agents via Reliable Self-Verification

TL;DR

ReVeal addresses unreliable self-verification in reinforcement-learning-based code reasoning by introducing a multi-turn generation-verification framework with tool-assisted evaluation. It employs TAPO, a turn-aware credit mechanism that decomposes rewards into outcome, generation, and verification components and assigns them across token- and turn-level units to stabilize learning and prevent gaming. Empirically, ReVeal achieves superior Pass@k and enables 20+ inference-turn scaling from training on only 3 turns on LiveCodeBench, demonstrating robust extrapolation and co-evolution of code and test generation. The results suggest that explicitly optimizing verification signals yields deeper exploration, stronger verification capabilities, and a scalable pathway to autonomous, self-improving AI agents across tasks with verifiable rewards.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.

Paper Structure

This paper contains 42 sections, 9 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Performance of ReVeal on LiveCodeBench V6. (a) ReVeal enables effective test-time scaling, with Pass@1 accuracy improving from 34.8% at turn 1 to 38.7% at turn 25. (b) ReVeal (max_turn=10) consistently outperforms both the base model and the RL baseline in Pass@k, expanding the base model’s reasoning boundaries, which the RL baseline fails to achieve.
  • Figure 2: ReVeal expands the V-G gap.
  • Figure 3: Illustration of ReVeal. (a) Iterative generation-verification loop with tool feedback. (b) TAPO with joint verifiable rewards: outcome, generation, and verification rewards.
  • Figure 4: Training curves of (a) first-turn code accuracy, (b) last-turn code accuracy, (c) last-turn verification accuracy on filtered correct test cases, and (d) last-turn test-case accuracy for three methods. Note on (b): the dip before step 40 is due to expanded evaluation coverage: as format score reaches 0.9 around step 40, more problems enter the evaluation set, temporarily lowering accuracy.
  • Figure 5: Comparison of code accuracy, test case accuracy, and response length across training for ReVeal (Qwen2.5-32B-Instruct) with turn-level rewards, ReVeal with outcome-only rewards, and single-turn RL without tool integration.