Table of Contents
Fetching ...

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin

TL;DR

MTR-Bench tackles the lack of benchmarks for multi-turn interactive reasoning in LLMs by introducing an automated evaluation framework with Generator/Monitor/Evaluator modules. It constructs 40 tasks across four reasoning categories (IP, DA, SO, SG) and calibrates difficulty across three levels, totaling $3{,}600$ evaluation instances. The paper presents a comprehensive empirical study spanning diverse models, revealing that frontier reasoning models still struggle with sustained multi-turn reasoning, with performance often trading off efficiency and suffering from invalid operations. These findings underscore the importance of scalable, automated evaluation in guiding future research toward robust, interactive AI systems and provide a blueprint for ongoing benchmark evolution as model capabilities advance.

Abstract

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

TL;DR

MTR-Bench tackles the lack of benchmarks for multi-turn interactive reasoning in LLMs by introducing an automated evaluation framework with Generator/Monitor/Evaluator modules. It constructs 40 tasks across four reasoning categories (IP, DA, SO, SG) and calibrates difficulty across three levels, totaling evaluation instances. The paper presents a comprehensive empirical study spanning diverse models, revealing that frontier reasoning models still struggle with sustained multi-turn reasoning, with performance often trading off efficiency and suffering from invalid operations. These findings underscore the importance of scalable, automated evaluation in guiding future research toward robust, interactive AI systems and provide a blueprint for ongoing benchmark evolution as model capabilities advance.

Abstract

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

Paper Structure

This paper contains 67 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: This figure represents the complete framework of our benchmark, from construction to evaluation. It includes four modules: data collection, data classification, dataset construction, and interactive evaluation. After the dataset is built, the evaluation system can perform automated multi-round interactive evaluations and automatically increase the difficulty of the problems.
  • Figure 2: This figure illustrates examples of our four task types. Each task includes interaction rules, query format requirements, and example interactions, with three levels of input difficulty.
  • Figure 3: Model accuracy v.s. interaction turns across different tasks and difficulty levels.
  • Figure 4: Efficiency comparison of interaction turns between models on correctly-answered problems. For each pair (A vs B), A is labeled as Less if it requires fewer turns than B, and More otherwise. A higher proportion of Less indicates superior efficiency in problem-solving.
  • Figure 5: Invalid rate across evaluated models. Larger rate indicates weaker instruction-following and reasoning capabilities.