Self-Consistency of Large Language Models under Ambiguity

Henning Bartsch; Ole Jorgensen; Domenic Rosati; Jason Hoelscher-Obermaier; Jacob Pfau

Self-Consistency of Large Language Models under Ambiguity

Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau

TL;DR

The paper investigates how large language models maintain self-consistency when multiple answers can be correct, using a benchmark based on ambiguous integer sequences and an open-source dataset. It analyzes several OpenAI models under greedy decoding to compare sequence completions and generating explanations, and introduces a nonparametric test to probe alternative outputs. Findings show self-consistency increases with model capability and persists under robustness tests, yet models exhibit calibration gaps and non-negligible probability mass on alternative correct answers, with the ability to verbalize multiple alternatives varying by model. Overall, self-consistency appears to be an emergent property not explicitly trained, carrying important implications for reliability and safety in tasks with under-specification.

Abstract

Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.

Self-Consistency of Large Language Models under Ambiguity

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 7 figures, 5 tables)

This paper contains 20 sections, 1 equation, 7 figures, 5 tables.

Introduction
Dataset: Ambiguous Integer Sequences
Methodology: Evaluating Consistency
Explanation and completion accuracy
Explanation and completion consistency
Consistency and Capability
Robustness Checks for Consistency
Consistency Across Speaker Changes
Consistency Across Base Changes
Distributional Analysis of Model Consistency
Models Do Not Converge to Calculating a Unique Solution
Verbalizing Alternatives
Related Work
Conclusion
Limitations
...and 5 more sections

Figures (7)

Figure 1: Cross-context consistency (orange). Model-judged consistency (blue); this drops drastically for gpt-4, which underestimates the consistency across answers itself produced.
Figure 2: Explanation and sequence completion accuracies plotted against cross-context consistency and model-judged consistency (mean over sequence lengths). Further illustration of gpt-4's inability to correctly assess its own consistency despite being much more consistent.
Figure 3: Cross-context consistency plotted against explanation correctness, varying either the role prompt (left-hand side) or the base-representation of the integer sequences being evaluated on (middle and right-hand side).
Figure 4: Rate at which correct completion alternatives are assigned non-trivial probability mass by function class sampled for few shot exemplars. Across sampling methods, that rate is relatively high indicating a consistent consideration of correct alternatives across contexts.
Figure 5: Distribution over output probabilities for correct and incorrect completions for the sampling function type random_class. Each histogram is normalized by the data points of the corresponding class label. With KL-divergences of $KL(\text{correct\_and\_pred} || \text{correct\_not\_pred})=1.71$ and $KL(\text{correct\_and\_pred} || \text{incorrect\_not\_pred})=3.45$ bits, the distributions of correct answers have higher overlap.
...and 2 more figures

Theorems & Definitions (1)

Definition A.1: Integer Sequence Ambiguity

Self-Consistency of Large Language Models under Ambiguity

TL;DR

Abstract

Self-Consistency of Large Language Models under Ambiguity

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (1)