Table of Contents
Fetching ...

Adaptive Stopping for Multi-Turn LLM Reasoning

Xiaofan Zhou, Huy Nguyen, Bo Yu, Chenxi Liu, Lu Cheng

Abstract

Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

Adaptive Stopping for Multi-Turn LLM Reasoning

Abstract

Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

Paper Structure

This paper contains 26 sections, 2 theorems, 29 equations, 3 figures, 1 table.

Key Result

Theorem 1

Let $\{(x_i, \mathcal{P}_i^*)\}_{i=1}^{n_{\mathrm{cal}}}$ be the calibration set, where $\mathcal{P}_i^* = \{g \in \mathcal{G}_i \mid \mathcal{T}(g, x_i) \neq \emptyset\}$ is the set of retrievable gold passages for question $x_i$. Let $s^*(g, x_i) = \max_{t \in \mathcal{T}(g, x_i)} s_t(g)$ be the o Then for a new exchangeable test example, any retrievable gold passage $g \in \mathcal{P}_{n+1}^*$

Figures (3)

  • Figure 1: MiCP pipeline.
  • Figure 2: Empirical gold retention rate vs. the target $1-\alpha$ across five datasets on adaptive RAG and ReAct for retrieval calibration. All models maintain coverage at or above the $1-\alpha$ guarantee for all tested error rates.
  • Figure 3: Empirical coverage rate vs. the target $1-\alpha$ across five datasets on adaptive RAG and ReAct for the final prediction set. All models maintain coverage at or above the $1-\alpha$ guarantee for all tested error rates.

Theorems & Definitions (5)

  • Theorem 1: Retrieval Coverage Guarantee
  • Remark 2
  • Theorem 3: Prediction Set Coverage Guarantee
  • Remark 4: Role of the error budget decomposition
  • Remark 5: Conditional vs. marginal coverage