Table of Contents
Fetching ...

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou

TL;DR

RLSTA is a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains and exhibits strong cross-domain generalization and proves effective even without external verifiers, highlighting its potential for general-domain applications.

Abstract

While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

TL;DR

RLSTA is a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains and exhibits strong cross-domain generalization and proves effective even without external verifiers, highlighting its potential for general-domain applications.

Abstract

While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.
Paper Structure (43 sections, 6 equations, 12 figures, 7 tables)

This paper contains 43 sections, 6 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Contextual Inertia in multi-turn interaction: The persistence of the initial response leads to an unsatisfactory final answer.
  • Figure 2: Overview of our data preparation and multi-turn simulation pipeline. We partition single-turn prompts into segments to simulate two multi-turn scenarios: MT-Add (incremental information addition) and MT-Refine (correction of initially incorrect conditions).
  • Figure 3: Distributions of Contextual Inertia Intensity $\texttt{I}_{\mathrm{CI}}(m_n, m_{n-1})$. We use GPT-4o to categorize the inertia intensity as Weak (1), Moderate (2), and Strong (3). In most cases, the model's final answer $m_n$ exhibits strong inertia intensity to the preceding response $m_{n-1}$. Notably, the distribution of this intensity remains indistinguishable regardless of the conversation history quality (high vs. low), providing empirical evidence for the indiscriminate nature of contextual inertia.
  • Figure 4: Root cause of failures in multi-turn conversations: failures predominantly originate from previous responses (Misleading Context and Propagated Error), which are driven by Contextual Inertia.
  • Figure 5: Single-Turn Anchor Reward ($R_s$). We leverage the model's superior single-turn ability on full instruction ($i^{\mathrm{full}}$) as anchor for the multi-turn final answer $m_n$. Our filtering strategy (Equation \ref{['eq:data_filter']}) ensures the anchor is reliable by retaining only histories where single-turn performance exceeds multi-turn.
  • ...and 7 more figures