Table of Contents
Fetching ...

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, Jun Wang, Weinan Zhang

TL;DR

Function-calling models exhibit unstable performance across benchmarks due to naming cues. The authors introduce Hammer, a lightweight on-device approach that uses function masking and an irrelevance-augmented dataset to improve generalization and avoid misguidance from function and parameter names. They provide a tuning framework, an irrelevance-detection dataset, and the Hammer models, achieving state-of-the-art results on BFCL v2 and strong cross-benchmark performance. The work offers practical benefits for secure edge deployments of autonomous agents with tool capabilities.

Abstract

Large language models have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models' sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking

TL;DR

Function-calling models exhibit unstable performance across benchmarks due to naming cues. The authors introduce Hammer, a lightweight on-device approach that uses function masking and an irrelevance-augmented dataset to improve generalization and avoid misguidance from function and parameter names. They provide a tuning framework, an irrelevance-detection dataset, and the Hammer models, achieving state-of-the-art results on BFCL v2 and strong cross-benchmark performance. The work offers practical benefits for secure edge deployments of autonomous agents with tool capabilities.

Abstract

Large language models have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models' sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.
Paper Structure (19 sections, 7 figures, 7 tables)

This paper contains 19 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Demonstration of a simple function-calling process.
  • Figure 2: Case studies examining the performance degradation when function names and parameter names are obfuscated during test time.
  • Figure 3: Step-by-step building workflow of Hammer series with function masking.
  • Figure 4: Demonstration of different function-calling tasks.
  • Figure 5: An ablation to evaluate the impact of different masking ratios. For instance, "mask 0.33" denotes that 33% of the instances in the training batch are masked, while others remain unaltered.
  • ...and 2 more figures