Table of Contents
Fetching ...

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

Ippei Fujisawa, Sensho Nobe, Hiroki Seto, Rina Onda, Yoshiaki Uchida, Hiroki Ikoma, Pei-Chun Chien, Ryota Kanai

TL;DR

A special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization is designed, which enables a thorough assessment of state-of-the-art LLMs' ability to follow instructions.

Abstract

Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi-step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi-step inference. To this end, we design a special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions are entirely detailed within the instructions. This setup allows models to solve problems solely by following the provided directives. By constructing problems that require varying numbers of steps to solve and evaluating responses at each step, we enable a thorough assessment of state-of-the-art LLMs' ability to follow instructions. To ensure the robustness of our evaluation, we include multiple distinct tasks. Furthermore, by comparing accuracy across tasks, utilizing step-aware metrics, and applying separately defined measures of complexity, we conduct experiments that offer insights into the capabilities and limitations of LLMs in reasoning tasks. Our findings have significant implications for the development of LLMs and highlight areas for future research in advancing their reasoning abilities. Our dataset is available at \url{https://huggingface.co/datasets/ifujisawa/procbench} and code at \url{https://github.com/ifujisawa/proc-bench}.

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

TL;DR

A special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization is designed, which enables a thorough assessment of state-of-the-art LLMs' ability to follow instructions.

Abstract

Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi-step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi-step inference. To this end, we design a special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions are entirely detailed within the instructions. This setup allows models to solve problems solely by following the provided directives. By constructing problems that require varying numbers of steps to solve and evaluating responses at each step, we enable a thorough assessment of state-of-the-art LLMs' ability to follow instructions. To ensure the robustness of our evaluation, we include multiple distinct tasks. Furthermore, by comparing accuracy across tasks, utilizing step-aware metrics, and applying separately defined measures of complexity, we conduct experiments that offer insights into the capabilities and limitations of LLMs in reasoning tasks. Our findings have significant implications for the development of LLMs and highlight areas for future research in advancing their reasoning abilities. Our dataset is available at \url{https://huggingface.co/datasets/ifujisawa/procbench} and code at \url{https://github.com/ifujisawa/proc-bench}.
Paper Structure (17 sections, 4 equations, 13 figures, 3 tables)

This paper contains 17 sections, 4 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: An example from the task DeleteChar. (a) shows the input prompt, where the task is to iteratively remove specific letters from the given string according to the provided steps. (b) represents the ground truth label, which demonstrates the intermediate and final states of the string after performing each step of deletion.
  • Figure 2: Performance Metrics: SM, PA, FM, and PML across models and problem length.
  • Figure 3: Proportion of PA across problem lengths for o1-preview.
  • Figure 4: Proportion of Correct Predictions at or above step threshold.
  • Figure 5: Prefix Match Length (PML) for different problem lengths across all models and three tasks; FindCyclic, Compare and Sort. Each bar in the graph represents the average PML for a given problem length, with separate graphs for each model-task pair.
  • ...and 8 more figures