Table of Contents
Fetching ...

Self-Consistency of Large Language Models under Ambiguity

Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau

TL;DR

The paper investigates how large language models maintain self-consistency when multiple answers can be correct, using a benchmark based on ambiguous integer sequences and an open-source dataset. It analyzes several OpenAI models under greedy decoding to compare sequence completions and generating explanations, and introduces a nonparametric test to probe alternative outputs. Findings show self-consistency increases with model capability and persists under robustness tests, yet models exhibit calibration gaps and non-negligible probability mass on alternative correct answers, with the ability to verbalize multiple alternatives varying by model. Overall, self-consistency appears to be an emergent property not explicitly trained, carrying important implications for reliability and safety in tasks with under-specification.

Abstract

Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.

Self-Consistency of Large Language Models under Ambiguity

TL;DR

The paper investigates how large language models maintain self-consistency when multiple answers can be correct, using a benchmark based on ambiguous integer sequences and an open-source dataset. It analyzes several OpenAI models under greedy decoding to compare sequence completions and generating explanations, and introduces a nonparametric test to probe alternative outputs. Findings show self-consistency increases with model capability and persists under robustness tests, yet models exhibit calibration gaps and non-negligible probability mass on alternative correct answers, with the ability to verbalize multiple alternatives varying by model. Overall, self-consistency appears to be an emergent property not explicitly trained, carrying important implications for reliability and safety in tasks with under-specification.

Abstract

Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.
Paper Structure (20 sections, 1 equation, 7 figures, 5 tables)

This paper contains 20 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Cross-context consistency (orange). Model-judged consistency (blue); this drops drastically for gpt-4, which underestimates the consistency across answers itself produced.
  • Figure 2: Explanation and sequence completion accuracies plotted against cross-context consistency and model-judged consistency (mean over sequence lengths). Further illustration of gpt-4's inability to correctly assess its own consistency despite being much more consistent.
  • Figure 3: Cross-context consistency plotted against explanation correctness, varying either the role prompt (left-hand side) or the base-representation of the integer sequences being evaluated on (middle and right-hand side).
  • Figure 4: Rate at which correct completion alternatives are assigned non-trivial probability mass by function class sampled for few shot exemplars. Across sampling methods, that rate is relatively high indicating a consistent consideration of correct alternatives across contexts.
  • Figure 5: Distribution over output probabilities for correct and incorrect completions for the sampling function type random_class. Each histogram is normalized by the data points of the corresponding class label. With KL-divergences of $KL(\text{correct\_and\_pred} || \text{correct\_not\_pred})=1.71$ and $KL(\text{correct\_and\_pred} || \text{incorrect\_not\_pred})=3.45$ bits, the distributions of correct answers have higher overlap.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition A.1: Integer Sequence Ambiguity