Table of Contents
Fetching ...

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning

Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Na Li, Chuchu Fan

TL;DR

R1-Code-Interpreter presents a general-purpose approach to train LLMs to reason with code across 144 heterogeneous tasks via a two-stage process: supervised fine-tuning with 6.5k multi-turn trajectories and reinforcement learning using Group Relative Policy Optimization. A novel multi-stage curriculum guided by improvement potential Pi_i mitigates sparse-reward and task-heterogeneity challenges, boosting RL gains from +3.4% to +9.3% and delivering the R1-CI-14B model that reaches $72.4\%$ test accuracy, surpassing GPT-4o baselines. The work demonstrates emergent self-checking through code execution and shows significant training-time reductions via a Code Execution Sandbox that decouples gradient computation from code runs. Overall, the paper provides a scalable, open-source path toward robust, multi-task Code Interpreter integration in LLMs, with implications for symbolic reasoning, programmatic problem-solving, and AI-assisted planning across domains.

Abstract

Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4% to +9.3% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning

TL;DR

R1-Code-Interpreter presents a general-purpose approach to train LLMs to reason with code across 144 heterogeneous tasks via a two-stage process: supervised fine-tuning with 6.5k multi-turn trajectories and reinforcement learning using Group Relative Policy Optimization. A novel multi-stage curriculum guided by improvement potential Pi_i mitigates sparse-reward and task-heterogeneity challenges, boosting RL gains from +3.4% to +9.3% and delivering the R1-CI-14B model that reaches test accuracy, surpassing GPT-4o baselines. The work demonstrates emergent self-checking through code execution and shows significant training-time reductions via a Code Execution Sandbox that decouples gradient computation from code runs. Overall, the paper provides a scalable, open-source path toward robust, multi-task Code Interpreter integration in LLMs, with implications for symbolic reasoning, programmatic problem-solving, and AI-assisted planning across domains.

Abstract

Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such as math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4% to +9.3% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

Paper Structure

This paper contains 27 sections, 5 theorems, 13 equations, 10 figures, 4 tables.

Key Result

Lemma C.1

If $r_1,\dots,r_G \overset{\mathrm{i.i.d.}}{\sim}\mathrm{Bernoulli}(p)$ and $\bar{r}=\tfrac1G\sum_{j=1}^G r_j$, then

Figures (10)

  • Figure 1: Training Code Interpreter-augmented reasoning models with multi-stage GRPO on 144 reasoning and planning tasks. (a) Our best model, R1-CI-14B, outperforms both GPT-4o (text-only) and GPT-4o with Code Interpreter. (b) Training reward and test scores improve steadily through the curriculum learning, then plateau at stage 4 after adding low-potential samples. (c) To assess sample effectiveness, we estimate improvement potential by repeatedly sampling answers with different agent frameworks and analyzing the correct/wrong distribution. GRPO begins with high-potential samples and gradually incorporates lower-potential ones.
  • Figure 2: Example response of R1-Code-Interpreter in Blocksworld task.
  • Figure 3: GRPO training without curriculum learning. (a) Training rewards increase slightly in the early steps, then plateau. (b) In the 14B setting, test scores across individual tasks (colored lines) show diverse trends, while the average score (bold black line) rises slightly before plateauing, mirroring (a). (c) Training curve on the single task Game24. (d) Average score improvement vs. number of tasks for GRPO training.
  • Figure 4: Multi-stage curriculum learning with the guidance of measured improvement potential for each sample.
  • Figure 5: Score distribution across 144 training and testing tasks for the four compared methods.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Lemma C.1: Within-group Bernoulli variance
  • proof
  • Proposition 1: Policy term is controlled by $p(1-p)$
  • proof
  • Corollary C.1: Vanishing signal at the extremes
  • Lemma C.2: Concentration of the potential estimator
  • Proposition 2: Potential aligns with gradient strength
  • Remark C.1: Clipping and KL