Table of Contents
Fetching ...

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartáč, Mateusz Lango, Ondřej Dušek

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

Reasoning Gets Harder for LLMs Inside A Dialogue

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
Paper Structure (58 sections, 3 equations, 37 figures, 4 tables)

This paper contains 58 sections, 3 equations, 37 figures, 4 tables.

Figures (37)

  • Figure 1: An example from our Boulder benchmark, showing the same problem instance in two variants: as an isolated task and within a task-oriented dialogue.
  • Figure 2: Evaluation results of 8 LLMs in the three main settings, averaged over all tasks. Asterisks indicate statistically significant differences between the Baseline and Dialogue, and Dialogue and Dialogue-concise settings (t-test, $*\colon p < 0.05$, $**\colon p < 0.01$, ${*}{*}{*}\colon p < 0.001$).
  • Figure 3: Results for the Dialogue with reduced domains and Dialogue without tools ablations, micro-averaged over all tasks. Significance indication follows Figure \ref{['fig:results_main']}, and each ablation is compared to the main Dialogue setup.
  • Figure 4: Results for the Multi-turn baseline and Single-turn dialogue ablations, micro-averaged over all tasks. Significance indication follows Figure \ref{['fig:results_main']}, and each ablation is compared with the corresponding main setup.
  • Figure 5: Results for the Baseline with dialogue role and Dialogue with reasoning instruction ablations, micro-averaged over all tasks. Significance indication follows Figure \ref{['fig:results_main']}, and each ablation is compared with the corresponding main setup.
  • ...and 32 more figures