Table of Contents
Fetching ...

Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

Raad Khraishi, Iman Zafar, Katie Myles, Greig A Cowan

TL;DR

A switch-matrix benchmark is introduced that decomposes switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks, and finds systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix.

Abstract

Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.

Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems

TL;DR

A switch-matrix benchmark is introduced that decomposes switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks, and finds systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix.

Abstract

Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.
Paper Structure (7 sections, 2 equations, 3 figures, 6 tables)

This paper contains 7 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Observed vs. predicted mean drift under a two-way additive model $\hat{\Delta}_{A\to B}=c+\alpha_A+\beta_B$ on off-diagonal switch cells. CoQA (left): drift in last-turn F1. Multi-IF (right): drift in turn-3 strict success. The model explains most variance across switch cells ($R^2=0.83$ CoQA, $R^2=0.85$ Multi-IF), supporting a factorized view of handoff effects into prefix influence and suffix susceptibility.
  • Figure 2: Estimated prefix influence $\alpha_A$ (sum-to-zero) and suffix susceptibility $\beta_B$ (sum-to-zero) from the additive drift model on CoQA. For $\alpha_A$, positive values indicate prefixes that tend to improve downstream suffix performance relative to suffix no-switch baselines; negative values indicate harmful prefix regimes. For $\beta_B$, positive values indicate suffix models that tend to benefit from foreign prefixes; negative values indicate degradation under context mismatch.
  • Figure 3: Estimated prefix influence $\alpha_A$ (sum-to-zero) and suffix susceptibility $\beta_B$ (sum-to-zero) from the additive drift model on Multi IF. For $\alpha_A$, positive values indicate prefixes that tend to improve downstream suffix performance relative to suffix no-switch baselines; negative values indicate harmful prefix regimes. For $\beta_B$, positive values indicate suffix models that tend to benefit from foreign prefixes; negative values indicate degradation under context mismatch.