ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx
TL;DR
ToolComp tackles the challenge of evaluating multi-step tool-use reasoning by pairing final answers with rich per-step supervision. It presents 485 human-verified prompts and 1731 per-step labels and evaluates 16 models across six families, revealing that many models struggle to achieve high accuracy. The authors compare process-supervised reward models (PRMs) to outcome-supervised reward models (ORMs) using synthetic data, finding that PRMs deliver substantial benefits in ranking trajectories (e.g., rank@1 improvements of 19% for base and 11% for fine-tuned trajectories). By demonstrating stronger generalization and scalability for PRMs and providing a detailed benchmark and open-source plan, ToolComp advances evaluation and training for robust, multi-tool AI systems.
Abstract
Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.
