Table of Contents
Fetching ...

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, Usman Naseem

TL;DR

TurnBench-MS introduces a dynamic, interactive benchmark to evaluate multi-turn, multi-step reasoning in large language models by framing it as a Turing Machine-inspired code-breaking game. It provides ground-truth intermediate reasoning, two modes (Classic and Nightmare), and a robust automated pipeline to assess both final decisions and reasoning processes. Across 540 instances, state-of-the-art models lag behind humans, with performance dramatically dropping in Nightmare mode, highlighting substantial gaps in current reasoning capabilities. The work also analyzes error dynamics, scalability effects, and contamination resistance, offering a rigorous framework for diagnosing and advancing multi-turn reasoning in LLMs with potential practical impact for real-world AI systems.

Abstract

Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by the "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 84% accuracy in Classic mode, but performance drops to 18% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

TL;DR

TurnBench-MS introduces a dynamic, interactive benchmark to evaluate multi-turn, multi-step reasoning in large language models by framing it as a Turing Machine-inspired code-breaking game. It provides ground-truth intermediate reasoning, two modes (Classic and Nightmare), and a robust automated pipeline to assess both final decisions and reasoning processes. Across 540 instances, state-of-the-art models lag behind humans, with performance dramatically dropping in Nightmare mode, highlighting substantial gaps in current reasoning capabilities. The work also analyzes error dynamics, scalability effects, and contamination resistance, offering a rigorous framework for diagnosing and advancing multi-turn reasoning in LLMs with potential practical impact for real-world AI systems.

Abstract

Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by the "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 84% accuracy in Classic mode, but performance drops to 18% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

Paper Structure

This paper contains 102 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Accuracy versus average turns for leading LLMs and human evaluators (Best, Average) on TurnBench in both "Classic" and "Nightmare" modes. Insets show the relative accuracy drop of LLMs compared to the Best Human. Results highlight that LLMs remain substantially less accurate than humans, especially under the "Nightmare" setting, underscoring current limitations in complex multi-turn reasoning.
  • Figure 2: Overview of the TurnBench game framework.The LLM's objective is to deduce a secret 3-digit code composed of digits from 1 to 5. The game proceeds in iterative rounds, each comprising: 1) Proposal Step: The LLM submits a candidate 3-digit code. 2) Question Step: The LLM queries up to three verifiers, each providing Pass/Fail feedback based on its unique Hidden Active Criterion (HAC). 3) Deduce Step: The LLM analyzes the collective feedback to either Submit the final code if confident in its correctness, or 4) Continue (End of the round to the next round with a revised proposal. This iterative process continues until the LLM successfully deduces and submits the correct code.
  • Figure 3: Example of verification process. This verifier (Right) compares the values assigned to yellow and purple. There are three possible criteria: less than, equal to, or greater than. The Hidden Active Criterion (HAC) (Red) represents the specific constraint activated by a verifier in a given game setup. When a tested code satisfies this criterion, the verifier returns "PASS"; otherwise, it returns "FAIL" (Left).
  • Figure 4: The Reasoning Process Evaluation Pipeline in TurnBench. This pipeline analyzes the LLM's Chain-of-Thought (CoT) generated as it deduces verifier properties during game process (blue). The evaluation proceeds in three steps: 1) Inference Extraction (red): The LLM's CoT, detailing its reasoning for each verifier's Hidden Active Criterion (HAC), is processed by the Inference Extractor. This yields "Extracted Conclusions" – the LLM's inferred HAC for each verifier. 2) Ground Truth Collection (orange): Simultaneously, the "Current Game Setup ID" is used to retrieve the definitive "Ground Truth HAC" for each verifier from the "Game Metadata". 3) Judge (green): The Judger then semantically compares the "Extracted Conclusions" from Step 1 with the corresponding "Ground Truth HAC" from Step 2. Each inferred HAC is categorized as: Correct (semantically equivalent to the ground truth), Incorrect (completely wrong), or Include (the conclusion contains the correct answer but is not yet fully refined to the precise ground truth).
  • Figure 5: Probability of a model remaining incorrect in each subsequent round after its initial error, conditioned on it being incorrect in the previous round. The likelihood of continuing in an incorrect state increases with each turn, approaching near certainty beyond the fifth round. This trend highlights the models’ limited capacity for self-correction once they enter an error state.
  • ...and 2 more figures