Table of Contents
Fetching ...

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi

TL;DR

The paper tackles reward modeling for tool-using LLMs, addressing a gap in evaluating tool-calling reasoning. It introduces FC-RewardBench as the first benchmark for reward models in tool-calling and presents ToolRM, a suite of outcome reward models trained on synthetic data from open-weight LLMs. ToolRM variants (1.7B–14B) outperform larger baselines, delivering up to $25\%$ gains in Best-of-$n$ sampling, improved robustness to input noise, data-efficient fine-tuning via reward-guided filtering, and RL-training of policies without ground-truth rewards. These results demonstrate the importance of domain-specific reward signals for tool-use alignment and offer practical pathways for scalable, data-efficient RL in tool-enabled LLM systems. The work points to future directions such as chain-of-thought verifiers, environment-state integration, and a unified ORM/PRM framework to balance scalability with reasoning fidelity.

Abstract

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has emerged as a critical yet underexplored area of research. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark to systematically evaluate reward models in tool-calling scenarios. Our analysis shows that current reward models frequently miss key signals of effective tool use, highlighting the need for domain-specific modeling. We address this by proposing a training framework for outcome reward models using data synthesized from permissively licensed, open-weight LLMs. We introduce ToolRM - a suite of reward models for tool-use ranging from 1.7B to 14B parameters. Across diverse settings, these models consistently outperform general-purpose baselines. Notably, they achieve up to a 25% improvement with Best-of-N sampling, while also improving robustness to input noise, enabling effective data filtering, and supporting RL-training of policy models.

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

TL;DR

The paper tackles reward modeling for tool-using LLMs, addressing a gap in evaluating tool-calling reasoning. It introduces FC-RewardBench as the first benchmark for reward models in tool-calling and presents ToolRM, a suite of outcome reward models trained on synthetic data from open-weight LLMs. ToolRM variants (1.7B–14B) outperform larger baselines, delivering up to gains in Best-of- sampling, improved robustness to input noise, data-efficient fine-tuning via reward-guided filtering, and RL-training of policies without ground-truth rewards. These results demonstrate the importance of domain-specific reward signals for tool-use alignment and offer practical pathways for scalable, data-efficient RL in tool-enabled LLM systems. The work points to future directions such as chain-of-thought verifiers, environment-state integration, and a unified ORM/PRM framework to balance scalability with reasoning fidelity.

Abstract

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has emerged as a critical yet underexplored area of research. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark to systematically evaluate reward models in tool-calling scenarios. Our analysis shows that current reward models frequently miss key signals of effective tool use, highlighting the need for domain-specific modeling. We address this by proposing a training framework for outcome reward models using data synthesized from permissively licensed, open-weight LLMs. We introduce ToolRM - a suite of reward models for tool-use ranging from 1.7B to 14B parameters. Across diverse settings, these models consistently outperform general-purpose baselines. Notably, they achieve up to a 25% improvement with Best-of-N sampling, while also improving robustness to input noise, enabling effective data filtering, and supporting RL-training of policy models.

Paper Structure

This paper contains 34 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Performance of ToolRM, top reward models from RewardBench, Tool-augmented RM (Themis), and leading LLMs-as-judges on FC-RewardBench. Note: Model names are abbreviated for conciseness (e.g., L3.1-xx, SR-xx, and SC-xx correspond to Llama-3.1-xx, SkyWorks-Reward-xx, and SkyWorks-Critics-xx, respectively). Full model names are provided in Appendix \ref{['apdx:fcrb-experiment-details']}.
  • Figure 2: Performance of the Qwen3 series and xLAM-2 series in the Best-of-$n$$(n=32)$ setting across five benchmarks: API-Bank-1, API-Bank-2, NexusRaven, ToolAlpaca, and SealTools.
  • Figure 3: Representative example from FC-RewardBench the parameter player_count is set to an incorrect value. The tool catalog is hidden for brevity.
  • Figure 4: Data samples from ToolRM training data. Each sample has a tool catalog, a conversation between the user and assistant, along with the corresponding correct and incorrect tool calls. The top sample is missing one tool call from the Incorrect version, while the Bottom sample is missing a parameter from the tool call.
  • Figure 5: Correlation heatmap between FC-RewardBench performance and downstream accuracy across generator models and benchmarks, showing consistently strong alignment (avg. correlation = 0.84).
  • ...and 1 more figures