Table of Contents
Fetching ...

RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

Shaopeng Fu, Xingxing Zhang, Li Dong, Di Wang, Furu Wei

Abstract

While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.

RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

Abstract

While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.

Paper Structure

This paper contains 19 sections, 9 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of our RefineRL approach. (a) Skeptical-Agent employs local tools to evaluate generated solutions against visible public test cases and enforces rigorous self-refinement even when evaluation results suggest correctness. (b) Self-Refinement RL utilizes self-generated refinement trajectory data from the Skeptical-Agent on real-world CP problems data, along with a novel Squared-incentive reward function, to advance the self-refinement capabilities of LLMs.
  • Figure 2: Training dynamics of Qwen3-4B. The left plot shows the steady increase in reward, while the right plot shows the corresponding growth in response length, reflecting the model's internalization of the skeptical reasoning process.
  • Figure 3: Training dynamics of Qwen3-4B-2507. Similar to the base model, the 2507 variant demonstrates consistent learning progress in terms of reward maximization and increased reasoning length during the RL process.
  • Figure 4: The standard system prompt used for initial solution generation and subsequent refinement steps.
  • Figure 5: Prompt for error-driven feedback construction.
  • ...and 2 more figures