ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
TL;DR
The paper tackles reward modeling for tool-using LLMs, addressing a gap in evaluating tool-calling reasoning. It introduces FC-RewardBench as the first benchmark for reward models in tool-calling and presents ToolRM, a suite of outcome reward models trained on synthetic data from open-weight LLMs. ToolRM variants (1.7B–14B) outperform larger baselines, delivering up to $25\%$ gains in Best-of-$n$ sampling, improved robustness to input noise, data-efficient fine-tuning via reward-guided filtering, and RL-training of policies without ground-truth rewards. These results demonstrate the importance of domain-specific reward signals for tool-use alignment and offer practical pathways for scalable, data-efficient RL in tool-enabled LLM systems. The work points to future directions such as chain-of-thought verifiers, environment-state integration, and a unified ORM/PRM framework to balance scalability with reasoning fidelity.
Abstract
As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has emerged as a critical yet underexplored area of research. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark to systematically evaluate reward models in tool-calling scenarios. Our analysis shows that current reward models frequently miss key signals of effective tool use, highlighting the need for domain-specific modeling. We address this by proposing a training framework for outcome reward models using data synthesized from permissively licensed, open-weight LLMs. We introduce ToolRM - a suite of reward models for tool-use ranging from 1.7B to 14B parameters. Across diverse settings, these models consistently outperform general-purpose baselines. Notably, they achieve up to a 25% improvement with Best-of-N sampling, while also improving robustness to input noise, enabling effective data filtering, and supporting RL-training of policy models.
