Table of Contents
Fetching ...

Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

Bingguang Hao, Zengzhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang, Ji Zhang

TL;DR

This work tackles the challenge of training LLMs for robust function calling by balancing exploration of complex reasoning paths with stable policy optimization. It introduces EGPO, which integrates Chain-of-Thought entropy into Group Relative Policy Optimization, coupled with a clipping mechanism and a binary reward to guide learning. Empirical results on BFCL and related benchmarks show that EGPO, even at 4B parameters, achieves state-of-the-art performance among comparable-size models and outperforms larger proprietary systems on multi-turn tool-use tasks. The findings suggest that entropy-driven CoT exploration yields more structured, reliable tool invocation patterns, including parameter extraction and verification, with practical implications for real-world LLM deployments. Overall, EGPO advances efficient and robust tool-using capabilities in open-source LLMs, narrowing the gap to large-scale models.

Abstract

The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.

Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

TL;DR

This work tackles the challenge of training LLMs for robust function calling by balancing exploration of complex reasoning paths with stable policy optimization. It introduces EGPO, which integrates Chain-of-Thought entropy into Group Relative Policy Optimization, coupled with a clipping mechanism and a binary reward to guide learning. Empirical results on BFCL and related benchmarks show that EGPO, even at 4B parameters, achieves state-of-the-art performance among comparable-size models and outperforms larger proprietary systems on multi-turn tool-use tasks. The findings suggest that entropy-driven CoT exploration yields more structured, reliable tool invocation patterns, including parameter extraction and verification, with practical implications for real-world LLM deployments. Overall, EGPO advances efficient and robust tool-using capabilities in open-source LLMs, narrowing the gap to large-scale models.

Abstract

The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.

Paper Structure

This paper contains 26 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: By employing a more structured thinking and verification process, our EGPO framework correctly identifies all required parameters for the tool call, whereas the baseline GRPO model fails by swapping the arguments.
  • Figure 2: Overview of our EGPO framework. For a given query, EGPO calculates rewards using a single-criteria function and integrates CoT entropy with the advantage signal to guide the policy's exploration of reasoning paths.
  • Figure 3: Implementation of the data cleaning pipeline for reinforcement learning in function calling. We begin with LLM-based evaluation and correction, followed by Abstract Syntax Tree (AST) evaluation. Data is retained only after passing all stages or discarded after three regeneration attempts.
  • Figure 4: Performance on ACEBench and APIBank with all metrics calculated using the official scripts.
  • Figure 5: Visualization of the learning curves for EGPO and GRPO during training. We report the Average Reward, KL Divergence, Actor Entropy and Average Response Length.
  • ...and 6 more figures