Table of Contents
Fetching ...

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

Noel Ngu, Nathaniel Lee, Paulo Shakarian

TL;DR

The paper introduces domain-independent proxies for predicting failure in language model queries by quantifying response diversity. It defines three metrics—entropy $H$, Gini impurity $G$, and centroid distance $CD$—to capture diversity of outputs for a given prompt. Across multiple datasets and temperature settings, these proxies show strong correlation with the likelihood of error and are demonstrated to support few-shot prompting, chain-of-thought reasoning, and error detection. This approach provides a practical, domain-agnostic tool for assessing LM reliability and guiding prompting strategies without relying on task-specific signals.

Abstract

Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection.

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

TL;DR

The paper introduces domain-independent proxies for predicting failure in language model queries by quantifying response diversity. It defines three metrics—entropy , Gini impurity , and centroid distance —to capture diversity of outputs for a given prompt. Across multiple datasets and temperature settings, these proxies show strong correlation with the likelihood of error and are demonstrated to support few-shot prompting, chain-of-thought reasoning, and error detection. This approach provides a practical, domain-agnostic tool for assessing LM reliability and guiding prompting strategies without relying on task-specific signals.

Abstract

Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection.
Paper Structure (64 sections, 3 figures, 2 tables, 1 algorithm)

This paper contains 64 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Using the trim and clip commands produces fragile layers that can result in disasters (like this one from an actual paper) when the color space is corrected or the PDF combined with others for the final proceedings. Crop your figures properly in a graphics program -- not in LaTeX.
  • Figure 2: Adjusting the bounding box instead of actually removing the unwanted data resulted multiple layers in this paper. It also needlessly increased the PDF size. In this case, the size of the unwanted layer doubled the paper's size, and produced the following surprising results in final production. Crop your figures properly in a graphics program. Don't just alter the bounding box.
  • Figure 3: Example listing quicksort.hs