Table of Contents
Fetching ...

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett

TL;DR

MuSR introduces a scalable, narrative-based benchmark for multistep soft reasoning in language models by combining neurosymbolic generation with natural-language stories. It encompasses three domains—Murder Mysteries, Object Placements, and Team Allocation—each probing distinct reasoning faculties, from theory-of-mind to social-constraint optimization. The paper details a three-stage construction pipeline (Tree Template Construction, Reasoning Tree Completion, Story Generation), provides extensive dataset validation, and benchmarks a range of prompting strategies and neurosymbolic methods, highlighting gaps between current models and human performance. The MuSR framework also emphasizes reproducibility and a flexible generation process that can adapt to future model capabilities, offering a robust platform for evaluating and advancing robust reasoning in NLP. Overall, MuSR represents a significant step toward realistic, linguistically rich reasoning benchmarks that stress both content understanding and intermediate reasoning structure.

Abstract

While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

TL;DR

MuSR introduces a scalable, narrative-based benchmark for multistep soft reasoning in language models by combining neurosymbolic generation with natural-language stories. It encompasses three domains—Murder Mysteries, Object Placements, and Team Allocation—each probing distinct reasoning faculties, from theory-of-mind to social-constraint optimization. The paper details a three-stage construction pipeline (Tree Template Construction, Reasoning Tree Completion, Story Generation), provides extensive dataset validation, and benchmarks a range of prompting strategies and neurosymbolic methods, highlighting gaps between current models and human performance. The MuSR framework also emphasizes reproducibility and a flexible generation process that can adapt to future model capabilities, offering a robust platform for evaluating and advancing robust reasoning in NLP. Overall, MuSR represents a significant step toward realistic, linguistically rich reasoning benchmarks that stress both content understanding and intermediate reasoning structure.

Abstract

While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.
Paper Structure (62 sections, 2 figures, 9 tables, 1 algorithm)

This paper contains 62 sections, 2 figures, 9 tables, 1 algorithm.

Figures (2)

  • Figure 1: Dataset construction process for MuSR. First, we generate gold facts that are used to deduce the correct answer (the murderer in this case). Then, using an LLM, we create a reasoning tree leading to those deductions from facts in a story combined with commonsense. Finally, we iteratively generate a narrative one chunk at a time using the facts generated in step 2, validating the generations for fact consistency and recall.
  • Figure 2: Partial reasoning trees showing gold facts $F$, story facts $S(T)$, and commonsense facts $C(T)$ for each of our three domains. Dotted lines indicate incomplete trees. Each deduction sampled from an LLM will yield two scenario facts and one commonsense fact in our setup.