To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization
Haozhe Wang, Long Li, Chao Qu, Fengming Zhu, Weidi Xu, Wei Chu, Fangzhen Lin
TL;DR
This work addresses the rigidity of existing tool-integrated math language models by enabling autonomous, metacognitive code integration during reasoning. It introduces an Expectation-Maximization framework (AutoCode) that alternates guided exploration of code-using trajectories with off-policy RL optimization, treating code-triggering as a latent variable. Empirical results show substantial improvements on challenging benchmarks (e.g., MATH500 and AIME) and enhanced training efficiency, driven by structured data curation and learned code-use strategies with high selection accuracy. The approach demonstrates a synergistic improvement by effectively combining chain-of-thought reasoning with code execution, signaling a practical path toward more adaptive and capable math LLMs.
Abstract
Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness -- the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.
