Table of Contents
Fetching ...

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

Vaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx

TL;DR

ToolComp tackles the challenge of evaluating multi-step tool-use reasoning by pairing final answers with rich per-step supervision. It presents 485 human-verified prompts and 1731 per-step labels and evaluates 16 models across six families, revealing that many models struggle to achieve high accuracy. The authors compare process-supervised reward models (PRMs) to outcome-supervised reward models (ORMs) using synthetic data, finding that PRMs deliver substantial benefits in ranking trajectories (e.g., rank@1 improvements of 19% for base and 11% for fine-tuned trajectories). By demonstrating stronger generalization and scalability for PRMs and providing a detailed benchmark and open-source plan, ToolComp advances evaluation and training for robust, multi-tool AI systems.

Abstract

Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

TL;DR

ToolComp tackles the challenge of evaluating multi-step tool-use reasoning by pairing final answers with rich per-step supervision. It presents 485 human-verified prompts and 1731 per-step labels and evaluates 16 models across six families, revealing that many models struggle to achieve high accuracy. The authors compare process-supervised reward models (PRMs) to outcome-supervised reward models (ORMs) using synthetic data, finding that PRMs deliver substantial benefits in ranking trajectories (e.g., rank@1 improvements of 19% for base and 11% for fine-tuned trajectories). By demonstrating stronger generalization and scalability for PRMs and providing a detailed benchmark and open-source plan, ToolComp advances evaluation and training for robust, multi-tool AI systems.

Abstract

Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.
Paper Structure (90 sections, 18 figures, 12 tables)

This paper contains 90 sections, 18 figures, 12 tables.

Figures (18)

  • Figure 1: An example annotation path for collecting data that provides tool-call trajectories with human verified-final answers along with step-by-step process supervision labels. Each model generated step (Action Plan and ReAct steps) are first labelled as correct or incorrect. For the components labelled incorrect, a rewrite is made to correct the corresponding component. The annotations and rewrites are made by human annotators for the benchmark (or by a state-of-the-art LM for generating synthetic training data as further described in Section \ref{['sec:experiment_design']}). A full annotated trajectory example is available in Appendix \ref{['app:example_annotated_trajectory']}.
  • Figure 2: Comparison of step-wise reasoning accuracy (x-axis) and final answer accuracy (y-axis) on ToolComp across 6 different model families.
  • Figure 3: A comparison of outcome-supervised and process-supervised reward models across various scales of training data (10%, 25%, 50%, 100%), evaluated by their ability to pick out the best answer out of 30 tool call trajectories. The 95% confidence intervals captures the variance of 500 random samples of 30 completions out of 50 completions. We plot both the performance on generations from Llama-3.1-8b-Instruct (left) and Llama-3.1-8b-Instruct fine-tuned on all the preferred trajectories (right) dubey2024llama3herdmodels. The plot also shows the Pass@1 given by greedy sampling and the average Pass@30 accuracies for the respective generating models.
  • Figure 4: Distribution of the position of the maximum scoring step, normalized by the length of the trajectory, for the rank@1 selected trajectories.
  • Figure 5: Comparison of the performance of different aggregation methods used to combine step-wise level PRM scores. Results here use the PRM model trained on all data with the Full Step with Observation supervision method.
  • ...and 13 more figures