Table of Contents
Fetching ...

FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement

Bingguang Hao, ZengZhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang, Ji Zhang

TL;DR

This work targets robust tool use in large language models by addressing two overlooked issues in supervised fine-tuning: imbalanced training signals between lengthy reasoning and concise function calls, and a scarcity of hard, edge-case data. It introduces BalanceSFT, combining a Self-adjusted Signal Balancing loss and a Hard Data Re-sampling data refinement loop to dynamically reweight learning between CoT reasoning and function execution while iteratively producing high-quality hard examples guided by model errors. Empirical results on BFCL, ACEBench, and APIBank show that a 7B model trained with BalanceSFT achieves tool-calling performance competitive with GPT-4o on BFCL and strong generalization on tool-use benchmarks, all while mitigating catastrophic forgetting in coding tasks. The approach offers a scalable, non-RL pathway to robust, generalizable LLM tool use through balanced training and automated data refinement, with limitations including reliance on the initial CoT data quality and the computational cost of ensemble judgments during HDR.

Abstract

The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason

FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement

TL;DR

This work targets robust tool use in large language models by addressing two overlooked issues in supervised fine-tuning: imbalanced training signals between lengthy reasoning and concise function calls, and a scarcity of hard, edge-case data. It introduces BalanceSFT, combining a Self-adjusted Signal Balancing loss and a Hard Data Re-sampling data refinement loop to dynamically reweight learning between CoT reasoning and function execution while iteratively producing high-quality hard examples guided by model errors. Empirical results on BFCL, ACEBench, and APIBank show that a 7B model trained with BalanceSFT achieves tool-calling performance competitive with GPT-4o on BFCL and strong generalization on tool-use benchmarks, all while mitigating catastrophic forgetting in coding tasks. The approach offers a scalable, non-RL pathway to robust, generalizable LLM tool use through balanced training and automated data refinement, with limitations including reliance on the initial CoT data quality and the computational cost of ensemble judgments during HDR.

Abstract

The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs' function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs' natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs' function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub https://github.com/BingguangHao/FunReason

Paper Structure

This paper contains 23 sections, 10 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Two challenges in LLM Function Calling.(a) Imbalanced Training Signals: The lengthy Chain-of-Thought (CoT) tokens dominate the learning signal, overshadowing the concise but critical function call. (b) Imbalanced Data Hardness: The training dataset is dominated by simple examples, with a scarcity of hard cases necessary for robust performance.
  • Figure 2: Overview of the BalanceSFT framework. It starts with a standard function call dataset, which is refined through a Base Quality Check and Answer Check to create initial training data and identify hard data. The model is first initialized via a Cold Start using the Self-adjusted Signal Balancing (SSB) Loss. Subsequently, the Hard Data Re-sampling (HDR) strategy creates a Self-evolving Loop where the model iteratively reasons on hard cases, generates new solutions, and undergoes quality-gated retraining.
  • Figure 3: Performance comparison on ACEBench and APIBank benchmarks using official evaluation scripts, reported as accuracy (%).
  • Figure 4: More experiments. (a) Performance of two series models trained by GRPO and BalanceSFT on BFCL. (b) Performance of BalanceSFT and SFT models on HumanEval and MBPP (including HumanEval+ and MBPP+) compared with that of the code pre-trained model (Qwen2.5-Coder-7B-Inst). (c) Multi-Turn performance of BalanceSFT-7B at different stages in Self-evolving.
  • Figure 5: API categories distribution of xlam-function-calling-60k and Open-Agentic-tool-use. Different colors represent the distribution of different API categories.
  • ...and 7 more figures