Table of Contents
Fetching ...

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman

TL;DR

This work tackles the critical issue of LLM consistency across multi-turn interactions by introducing a dedicated evaluation framework and practical mitigation. It introduces Position-Weighted Consistency (PWC) to emphasize early stability and recovery, and MT-Consistency as a diverse benchmark assembled from established QA datasets. The Confidence-Aware Response Generation (CARG) framework leverages internal confidence signals to maintain stable, accurate responses over follow-ups. Together, these contributions demonstrate actionable paths to improve reliability of LLMs in high-stakes settings, with empirical evidence showing improved stability without sacrificing accuracy.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. Code and data are available at: https://github.com/yubol-bobo/MT-Consistency. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

TL;DR

This work tackles the critical issue of LLM consistency across multi-turn interactions by introducing a dedicated evaluation framework and practical mitigation. It introduces Position-Weighted Consistency (PWC) to emphasize early stability and recovery, and MT-Consistency as a diverse benchmark assembled from established QA datasets. The Confidence-Aware Response Generation (CARG) framework leverages internal confidence signals to maintain stable, accurate responses over follow-ups. Together, these contributions demonstrate actionable paths to improve reliability of LLMs in high-stakes settings, with empirical evidence showing improved stability without sacrificing accuracy.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. Code and data are available at: https://github.com/yubol-bobo/MT-Consistency. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.

Paper Structure

This paper contains 46 sections, 2 theorems, 13 equations, 8 figures, 9 tables.

Key Result

Proposition 4.1

For any two sequence $\mathbf{s}^h, \mathbf{s}^l$ with the same length $n$, if for some $i\in \{0, 1, \cdots, n-1\}$, we have $s_0^h=s_0^l, s_1^h=s_1^l,\cdots, s^h_i > s^l_i$, then there exists a discount factor $\gamma\in (0, 1/2)$ such that $f^{\gamma}(\mathbf{s}^h)>f^{\gamma}(\mathbf{s}^l)$. (See

Figures (8)

  • Figure 1: LLMs exhibit inconsistent behavior when deployed in high-stakes domains such as healthcare and education, often adapting their responses — and sometimes unpredictably — to user follow-ups and compromises factual accuracy and reduces reliability.
  • Figure 2: Overview of experimental designs and mitigation strategies. Left: Exp. 1 setup with a single message across multiple rounds. Middle: Exp. 2 setup with 8 different messages across multiple rounds. Right: Proposed Confidence-Aware Response Generation (CARG) method.
  • Figure 3: Initial accuracy of LLMs on benchmark tasks. Commercial models (e.g., Claude) significantly outperform open-source counterparts.
  • Figure 4: Impact of role-play interventions on GPT-4o. Left: Accuracy trends showing GPT-default and GPT-adversarial maintaining similar performance while GPT-friendly underperforms. Right: Confidence dynamics revealing that GPT-default's behavior aligns more closely with the adversarial setting, suggesting an inherent defensive stance.
  • Figure 5: Accuracy trends across follow-up rounds for different LLMs, comparing baseline models with our proposed CARG method.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 4.1
  • Corollary 4.1