Table of Contents
Fetching ...

Are Large Reasoning Models Interruptible?

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

TL;DR

The paper probes whether Large Reasoning Models remain reliable when reasoning is interrupted or when context changes mid-inference, challenging the frozen-world assumption common in static evaluations. It introduces a formal dynamic evaluation framework with time-constraint and update-driven interruptions and tests state-of-the-art LRMs on math and coding tasks, revealing robustness gaps and novel pathologies such as reasoning leakage, panic, and self-doubt. The findings show performance can drop by up to 60% when late updates are introduced, and that careful prompt guidance and interaction design can mitigate some failures, though not universally. The work highlights the need for architecture and prompting strategies tailored to interruptible, persistent reasoning in dynamic environments, with implications for real-world AI assistants and programming agents.

Abstract

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/

Are Large Reasoning Models Interruptible?

TL;DR

The paper probes whether Large Reasoning Models remain reliable when reasoning is interrupted or when context changes mid-inference, challenging the frozen-world assumption common in static evaluations. It introduces a formal dynamic evaluation framework with time-constraint and update-driven interruptions and tests state-of-the-art LRMs on math and coding tasks, revealing robustness gaps and novel pathologies such as reasoning leakage, panic, and self-doubt. The findings show performance can drop by up to 60% when late updates are introduced, and that careful prompt guidance and interaction design can mitigate some failures, though not universally. The work highlights the need for architecture and prompting strategies tailored to interruptible, persistent reasoning in dynamic environments, with implications for real-world AI assistants and programming agents.

Abstract

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/

Paper Structure

This paper contains 29 sections, 16 equations, 13 figures.

Figures (13)

  • Figure 1: How do LRMs perform in dynamic worlds? (a) Unlike static ‘frozen world’ settings that assume users wait for completion, real-world scenarios often demand mid-inference updates, as LRM reasoning can be time-consuming. We introduce a public evaluation suite to assess how LRMs handle interruptions across math and coding tasks (\ref{['sec:benchmarks']}). We define two types of interruptions: time-constrained (hard: immediate answer; soft: speedup reasoning) and update-driven (task specifications change mid-reasoning). (b) We found LRMs have three common failure modes: reasoning leakage can produce up to 10x longer answers after hard interrupts, "moving" their reasoning tokens to the answer segment; over 90% of new errors under speedup arise from panic, when the models prematurely terminate their reasoning process; and roughly 80% of update-driven interrupt errors stem from self-doubt, where models fail to validate and incorporate new information. Results are reported at 30% interruption points; detailed results are provided in \ref{['sec:result']}.
  • Figure 2: Efficiency and Accuracy under Hard Interrupts. The top row reports model performance (Pass@1, denoted as $A(X)$ in Section 3), while the bottom row shows absolute final answer lengths $L(X)$ under two settings across different interrupt position X. In the top row, we observe that LRMs behave almost like anytime models, with performance improving as more reasoning budget is provided. In the bottom row, we find evidence of reasoning leakage: when interrupted too early, models often continue reasoning in their final answers despite being forcibly terminated.
  • Figure 3: Answer Length Analysis under Soft Interrupts (Speedup). When receiving an instruction to speed up the reasoning process (i.e., soft interrupt), models generally comply, as the updated output length $L^*(X)$ is shorter than the original $L(X)$, with the exception of later interrupt positions (e.g., Magistral-S-1.2 on GSM8K at 0.9). However, soft interrupts can hurt performance on harder tasks such as AIME and LiveCodeBench, where GPT-OSS and Qwen sometimes exhibit answer panic, immediately producing incorrect outputs (see \ref{['fig:problem_setting']} and \ref{['fig:soft_interrupt_ratio']}).
  • Figure 4: Accuracy under Update-Driven Interrupts. When provided with updates mid-reasoning, models often suffer substantial performance drops. Adding prompt guidance fully resolves the issue on GSM8K and MATH500 and reduces the gap between full thinking and interrupt settings on AIME and LiveCodeBench datasets.
  • Figure 5: We observe that one of the behaviors causing the significant performance drops under update-driven interrupts is self-doubt(example in red), when models second-guess prior reasoning and produce incorrect answers even without output length constraints. Adding prompt guidance, a short targeted postfix written in the model’s own voice, can partially mitigate this issue (example in green), improving accuracy on math tasks and fully resolving self-doubt on GSM8K and MATH500. Additional qualitative examples are provided in \ref{['app:self_doubt']}.
  • ...and 8 more figures