Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

Bingguang Hao; Zengzhuang Xu; Maolin Wang; Yuntao Wen; Yicheng Chen; Cunyin Peng; Long Chen; Dong Wang; Xiangyu Zhao; Jinjie Gu; Chenyi Zhuang; Ji Zhang

Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

Bingguang Hao, Zengzhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang, Ji Zhang

TL;DR

This work tackles the challenge of training LLMs for robust function calling by balancing exploration of complex reasoning paths with stable policy optimization. It introduces EGPO, which integrates Chain-of-Thought entropy into Group Relative Policy Optimization, coupled with a clipping mechanism and a binary reward to guide learning. Empirical results on BFCL and related benchmarks show that EGPO, even at 4B parameters, achieves state-of-the-art performance among comparable-size models and outperforms larger proprietary systems on multi-turn tool-use tasks. The findings suggest that entropy-driven CoT exploration yields more structured, reliable tool invocation patterns, including parameter extraction and verification, with practical implications for real-world LLM deployments. Overall, EGPO advances efficient and robust tool-using capabilities in open-source LLMs, narrowing the gap to large-scale models.

Abstract

The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.

Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

TL;DR

Abstract

Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)