Table of Contents
Fetching ...

Increasing LLM response trustworthiness using voting ensembles

Aparna Nair-Kanneganti, Trevor J. Chan, Shir Goldfinger, Emily Mackay, Brian Anthony, Alison Pouch

TL;DR

This work presents a formal voting-ensemble framework for LLMs to address model hallucinations by allowing abstention when a consensus is not confidently reached. The authors derive theoretical results showing that, for large ensembles, optimality favors a permissive threshold ($k_{opt}=1$) and that trust can be dramatically increased with restrictive voting, depending on the question's deceptiveness $oldsymbol{\b4}$ and bewilderment $oldsymbol{\b8}$. Empirical results in arithmetic problem solving and clinical-note extraction confirm that restricting voting raises trust and reduces hallucinations with only modest declines in yield and accuracy. The approach offers a practical uncertainty-management tool for high-stakes applications like healthcare and data annotation, providing a tunable tradeoff between automation confidence and coverage.

Abstract

Despite huge advances, LLMs still lack convenient and reliable methods to quantify the uncertainty in their responses, making them difficult to trust in high-stakes applications. One of the simplest approaches to eliciting more accurate answers is to select the mode of many responses, a technique known as ensembling. In this work, we expand on typical ensembling approaches by looking at ensembles with a variable voting threshold. We introduce a theoretical framework for question answering and show that, by permitting ensembles to "abstain" from providing an answer when the dominant response falls short of the threshold, it is possible to dramatically increase the trustworthiness of the remaining answers. From this framework, we derive theoretical results as well as report experimental results on two problem domains: arithmetic problem solving and clinical-note question-answering. In both domains, we observe that large gains in answer trustworthiness can be achieved using highly restrictive voting ensembles, while incurring relatively modest reductions in response yield and accuracy. Due to this quality, voting ensembles may be particularly useful in applications - such as healthcare and data annotation - that require a high degree of certainty but which may not require that every question receive an automated answer.

Increasing LLM response trustworthiness using voting ensembles

TL;DR

This work presents a formal voting-ensemble framework for LLMs to address model hallucinations by allowing abstention when a consensus is not confidently reached. The authors derive theoretical results showing that, for large ensembles, optimality favors a permissive threshold () and that trust can be dramatically increased with restrictive voting, depending on the question's deceptiveness and bewilderment . Empirical results in arithmetic problem solving and clinical-note extraction confirm that restricting voting raises trust and reduces hallucinations with only modest declines in yield and accuracy. The approach offers a practical uncertainty-management tool for high-stakes applications like healthcare and data annotation, providing a tunable tradeoff between automation confidence and coverage.

Abstract

Despite huge advances, LLMs still lack convenient and reliable methods to quantify the uncertainty in their responses, making them difficult to trust in high-stakes applications. One of the simplest approaches to eliciting more accurate answers is to select the mode of many responses, a technique known as ensembling. In this work, we expand on typical ensembling approaches by looking at ensembles with a variable voting threshold. We introduce a theoretical framework for question answering and show that, by permitting ensembles to "abstain" from providing an answer when the dominant response falls short of the threshold, it is possible to dramatically increase the trustworthiness of the remaining answers. From this framework, we derive theoretical results as well as report experimental results on two problem domains: arithmetic problem solving and clinical-note question-answering. In both domains, we observe that large gains in answer trustworthiness can be achieved using highly restrictive voting ensembles, while incurring relatively modest reductions in response yield and accuracy. Due to this quality, voting ensembles may be particularly useful in applications - such as healthcare and data annotation - that require a high degree of certainty but which may not require that every question receive an automated answer.

Paper Structure

This paper contains 16 sections, 7 theorems, 14 equations, 9 figures, 1 table.

Key Result

Theorem 1

Accuracy is monotonic decreasing with respect to $k: P_C(k+1) \leq P_C(k) \ \forall \ k\in[1,...,n-1]$.

Figures (9)

  • Figure 1: Question difficulty is a function of deceptiveness ($\delta$), a question's tendency to elicit a single specious answer, and bewilderment ($\eta$), the degree to which it encourages random guessing. (Illustration generated by ChatGPT.)
  • Figure 2: Simulated voting scenarios show show that for a single question, deceptiveness $\delta$ alone dictates the ensemble accuracy as ensemble size approaches infinity. However, the rate of convergence is governed by the bewilderment.
  • Figure 3: (a) An ensemble of 50 models answers multiplication questions of varying difficulty. Notably, trust improves at high voting thresholds. (b) Similar behavior is observed when the ensemble evaluates arithmetic expressions with multiple operations.
  • Figure 4: An ensemble answers arithmetic questions using chain-of-thought prompting. Performance and question distributions are shown for questions involving (a) multiplication and (b) expressions with multiple operations.
  • Figure 5: A single model and ensemble extract salient features from the text of echocardiogram reports. Accuracy, yield, and trust are shown as a function of voting threshold and question distributions for model-extracted (a) Left ventricular ejection fraction (LVEF), (b) mitral stenosis (MS), and (c) mitral regurgitation (MR).
  • ...and 4 more figures

Theorems & Definitions (11)

  • Theorem 1: Accuracy maximization under permissive voting
  • Theorem 2: Yield maximization under permissive voting
  • Theorem 3: Maximal accuracy in the large-ensemble limit
  • Theorem 3: Accuracy maximization under permissive voting
  • proof
  • Theorem 3: Yield maximization under permissive voting
  • proof
  • Lemma 1: No no-consensus in the large-ensemble limit
  • proof
  • Theorem 3: Maximal accuracy in the large-ensemble limit
  • ...and 1 more