Table of Contents
Fetching ...

HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

Jun Wang, Jiamu Zhou, Muning Wen, Xiaoyun Mo, Haoyu Zhang, Qiqiang Lin, Cheng Jin, Xihuai Wang, Weinan Zhang, Qiuying Peng, Jun Wang

TL;DR

HammerBench provides a fine-grained, real-world benchmark for multi-turn function calling in mobile assistant scenarios, built from authentic app functionalities and anonymized logs. It introduces a four-stage data-generation pipeline, diverse interaction trajectories, and a snapshot-based evaluation framework with metrics like $Acc$, $PHR$, $PMR$, $PR$, and $SR$. Empirical results across 10 LLMs reveal that handling argument shifts and external information remains the primary challenge, with the Snapshot-based approach outperforming traditional Learning-to-Ask paradigms in task success. The work offers practical insights for enhancing LLM robustness in real-world, multi-turn function-calling tasks and provides a data/tooling foundation for future research.

Abstract

Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs' function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.

HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

TL;DR

HammerBench provides a fine-grained, real-world benchmark for multi-turn function calling in mobile assistant scenarios, built from authentic app functionalities and anonymized logs. It introduces a four-stage data-generation pipeline, diverse interaction trajectories, and a snapshot-based evaluation framework with metrics like , , , , and . Empirical results across 10 LLMs reveal that handling argument shifts and external information remains the primary challenge, with the Snapshot-based approach outperforming traditional Learning-to-Ask paradigms in task success. The work offers practical insights for enhancing LLM robustness in real-world, multi-turn function-calling tasks and provides a data/tooling foundation for future research.

Abstract

Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs' function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.

Paper Structure

This paper contains 37 sections, 1 equation, 3 figures, 13 tables.

Figures (3)

  • Figure 1: HammerBench construction pipeline: toolset collection, data generation, external knowledge generation, and validation. Blocks with GPT icons indicate the use of LLMs, while orange blocks represent verification modules, and green blocks denote various data types corresponding to each phase.
  • Figure 2: Examples of four types of test cases in HammerBench: 1) Diverse Q&A trajectories generated by merging user-agent interactions; 2) Intent shifts: agent terminates the session when users change their intent; 3) Argument shifts: three cases of changing slot values during interactions; 4) External individual information: users use pronouns instead of exact details, common in real-world interactions.
  • Figure 3: Statistics: a)The number of tools corresponding to different parameter counts in our toolset; b)The number of conversations corresponding to different turn counts in sQsA; c)Various number of multi-turn data cases constructed based on the Imperfect and External in \ref{['table:statistics']}.