Table of Contents
Fetching ...

Towards Effective Code-Integrated Reasoning

Fei Bai, Yingqian Min, Beichen Zhang, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen

TL;DR

This work tackles code-integrated reasoning by enabling LLMs to generate and execute code via a code interpreter within a tool-augmented RL framework. It identifies instability sources in tool-based training and proposes a dual strategy of exploration enhancement and stability maintenance, including budget scheduling and precise interaction boundaries. Empirical results across five math benchmarks show CIR achieving state-of-the-art performance and reveal mechanistic insights: code integration can extend reasoning capacity and produce concise, efficient solution paths, with non-executable code still contributing to learning. The findings underscore the practical potential of external code execution to boost mathematical reasoning and illuminate directions for generalizing code tooling to other domains.

Abstract

In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.

Towards Effective Code-Integrated Reasoning

TL;DR

This work tackles code-integrated reasoning by enabling LLMs to generate and execute code via a code interpreter within a tool-augmented RL framework. It identifies instability sources in tool-based training and proposes a dual strategy of exploration enhancement and stability maintenance, including budget scheduling and precise interaction boundaries. Empirical results across five math benchmarks show CIR achieving state-of-the-art performance and reveal mechanistic insights: code integration can extend reasoning capacity and produce concise, efficient solution paths, with non-executable code still contributing to learning. The findings underscore the practical potential of external code execution to boost mathematical reasoning and illuminate directions for generalizing code tooling to other domains.

Abstract

In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of interaction boundary based on stop tokens and precise matching. (a) The model first generates reasoning steps and then generates code. Interaction with the code interpreter is only triggered once a designated stop token, such as "'output, is emitted. Upon detecting this token, the preceding code segment is extracted, executed by the code interpreter, and the resulting output is appended to the model’s response, continuing the reasoning process. If the model generates code but fails to immediately emit the stop token, it may introduce noise or even miss a necessary interaction. (b) Execution is triggered when the model detects a complete and well-formed code block (e.g., "'python ... "'). At this point, the model pauses its reasoning and interacts with the code interpreter. This exact-match criterion helps ensure that only valid code blocks are executed, preventing noise or omissions from malformed outputs.
  • Figure 2: Training reward and average response length during the training process.
  • Figure 3: The test accuracy on AIME2024 and MATH500.
  • Figure 4: The average code generation number, code generation ratio and code pass rate during the training process.
  • Figure 5: The prompt triggering the model to utilize code-integrated reasoning.
  • ...and 2 more figures