Table of Contents
Fetching ...

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu

TL;DR

This work proposes a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction.

Abstract

Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

TL;DR

This work proposes a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction.

Abstract

Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

Paper Structure

This paper contains 52 sections, 4 theorems, 40 equations, 7 figures, 12 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $Z_k(\mathbf{x}) = \sum_{\mathbf{y}' \in \mathcal{Y}} \pi_{\theta_{k-1}}(\mathbf{y}'|\mathbf{x}) \exp\left(\frac{1}{\beta} r(\mathbf{x}, \mathbf{y}')\right)$ be the partition function over the entire response space $\mathcal{Y}$, the policy $\pi^*_{\theta_k}$ that maximizes the turn-wise RLHF ob

Figures (7)

  • Figure 1: An illustration of the Test-Time Policy Adaptation for Multi-Turn Interactions ( T$^2$PAM) paradigm. Different from static inference where the policy of model remains fixed ($\theta_0$, Turn 0), this paradigm treats conversational feedback as an active signal that guides real-time parameter updates (e.g., from $\theta_0$ to $\theta_1$). This iterative process of in-conversation self-correction allows the policy to progressively evolve and align with the preference of user ($\theta_n$) throughout the interaction.
  • Figure 2: LLM accuracy after 10 rounds of interaction with humans. Although LLM accuracy shows a continuous and gradual improvement, this prompt-based correction process is inefficient.
  • Figure 3: Number of newly solved problems per turn on the MATH dataset.
  • Figure 4: ROSA significantly boosts the rate of accuracy improvement in multi-turn interactions. These charts compare baseline models, RL described in Appendix \ref{['sec:ablation_strategy']}, and ROSA on different datasets. In contrast to the slow improvement shown in Figure \ref{['fig:accuracy_line_chart']}, ROSA not only achieves a higher absolute accuracy but also accelerates the learning process, as evidenced by the steeper slopes of the solid lines. This highlights efficiency of ROSA in online error correction.
  • Figure 5: Comparison of newly solved problems per round on MATH datasets.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 3.1: Closed-Form Optimal Policy
  • Theorem 4.1: Monotonic Error Reduction
  • Theorem 4.2: Cumulative Error Bound
  • Theorem 4.3: Unified Convergence Bound
  • proof
  • Definition 2.1: Single-Sample Feedback Constraint
  • proof : Derivation of the Practical Target from the Theoretical Optimum
  • proof
  • proof : Proof of Theorem \ref{['thm:cumulative_error']}
  • proof