Table of Contents
Fetching ...

interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

Vishak K Bhat, Prateek Chanda, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma

TL;DR

interwhen introduces a generalizable, verifier-guided framework for verifiable reasoning at test time, integrating meta-prompting to expose verifiable intermediate states and a Sequential Verifier to steer a single reasoning trace. It supports both self-verification and external verifiers, and a software abstraction (extract_state, verify, intervene) to plug in diverse verifiers. Empirical results show substantial token-efficiency gains in internal verification (e.g., up to ~32% reduction) and accuracy improvements in external verification (up to tens of percentage points) across spatial reasoning, math, and VERINA code/spec tasks, while preserving soundness. The framework enables flexible, scalable test-time verification across domains without constraining the model to fixed step decompositions, with practical impact for high-stakes applications and efficient reasoning.

Abstract

We present a test-time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers. Verified reasoning is an important goal in high-stakes scenarios such as deploying agents in the physical world or in domains such as law and finance. However, current techniques either rely on the generate-test paradigm that verifies only after the final answer is produced, or verify partial output through a step-extraction paradigm where the task execution is externally broken down into structured steps. The former is inefficient while the latter artificially restricts a model's problem solving strategies. Instead, we propose to verify a model's reasoning trace as-is, taking full advantage of a model's reasoning capabilities while verifying and steering the model's output only when needed. The key idea is meta-prompting, identifying the verifiable properties that any partial solution should satisfy and then prompting the model to follow a custom format in its trace such that partial outputs can be easily parsed and checked. We consider both self-verification and external verification and find that interwhen provides a useful abstraction to provide feedback and steer reasoning models in each case. Using self-verification, interwhen obtains state-of-the-art results on early stopping reasoning models, without any loss in accuracy. Using external verifiers, interwhen obtains 10 p.p. improvement in accuracy over test-time scaling methods, while ensuring 100% soundness and being 4x more efficient. The code for interwhen is available at https://github.com/microsoft/interwhen

interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

TL;DR

interwhen introduces a generalizable, verifier-guided framework for verifiable reasoning at test time, integrating meta-prompting to expose verifiable intermediate states and a Sequential Verifier to steer a single reasoning trace. It supports both self-verification and external verifiers, and a software abstraction (extract_state, verify, intervene) to plug in diverse verifiers. Empirical results show substantial token-efficiency gains in internal verification (e.g., up to ~32% reduction) and accuracy improvements in external verification (up to tens of percentage points) across spatial reasoning, math, and VERINA code/spec tasks, while preserving soundness. The framework enables flexible, scalable test-time verification across domains without constraining the model to fixed step decompositions, with practical impact for high-stakes applications and efficient reasoning.

Abstract

We present a test-time verification framework, interwhen, that ensures that the output of a reasoning model is valid wrt. a given set of verifiers. Verified reasoning is an important goal in high-stakes scenarios such as deploying agents in the physical world or in domains such as law and finance. However, current techniques either rely on the generate-test paradigm that verifies only after the final answer is produced, or verify partial output through a step-extraction paradigm where the task execution is externally broken down into structured steps. The former is inefficient while the latter artificially restricts a model's problem solving strategies. Instead, we propose to verify a model's reasoning trace as-is, taking full advantage of a model's reasoning capabilities while verifying and steering the model's output only when needed. The key idea is meta-prompting, identifying the verifiable properties that any partial solution should satisfy and then prompting the model to follow a custom format in its trace such that partial outputs can be easily parsed and checked. We consider both self-verification and external verification and find that interwhen provides a useful abstraction to provide feedback and steer reasoning models in each case. Using self-verification, interwhen obtains state-of-the-art results on early stopping reasoning models, without any loss in accuracy. Using external verifiers, interwhen obtains 10 p.p. improvement in accuracy over test-time scaling methods, while ensuring 100% soundness and being 4x more efficient. The code for interwhen is available at https://github.com/microsoft/interwhen
Paper Structure (34 sections, 3 figures, 8 tables)

This paper contains 34 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: LLM-Process Modulo: Intervening on partial output traces to guide a model towards a more accurate and/or more efficient solution. Meta-prompting instructs the model to output intermediate states that can be extracted and verified. In the first case, we track mentions of candidate solutions within the <think> block and insert </think> when the same answer appears consecutively. In the second case, we run an external verifier on each intermediate state and append verifier feedback to the model's output trace in case of any error. Compared to existing methods such as tree-of-thoughts, we do not artificially break up the problem into step-wise generation, and as a result, obtain both higher accuracy and efficiency.
  • Figure 2: Accuracy versus normalized reasoning-token usage for Qwen3-30B-A3B-Thinking-2507. The dotted line marks the baseline accuracy at full token usage.
  • Figure 3: Accuracy versus normalized reasoning-token usage for Qwen/QwQ-32B. The dotted line marks the baseline accuracy at full token usage.

Theorems & Definitions (1)

  • Remark 4.1: Merits of Sequential (State) Verifier