Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Yubo Li; Yidi Miao; Xueying Ding; Ramayya Krishnan; Rema Padman

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman

TL;DR

This work tackles the critical issue of LLM consistency across multi-turn interactions by introducing a dedicated evaluation framework and practical mitigation. It introduces Position-Weighted Consistency (PWC) to emphasize early stability and recovery, and MT-Consistency as a diverse benchmark assembled from established QA datasets. The Confidence-Aware Response Generation (CARG) framework leverages internal confidence signals to maintain stable, accurate responses over follow-ups. Together, these contributions demonstrate actionable paths to improve reliability of LLMs in high-stakes settings, with empirical evidence showing improved stability without sacrificing accuracy.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. Code and data are available at: https://github.com/yubol-bobo/MT-Consistency. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

TL;DR

Abstract

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)