Table of Contents
Fetching ...

Robust Knowledge Extraction from Large Language Models using Social Choice Theory

Nico Potyka, Yuqicheng Zhu, Yunjie He, Evgeny Kharlamov, Steffen Staab

TL;DR

This paper addresses the robustness shortcomings of large language models (LLMs) for high-stakes query answering by proposing repeated ranking queries and aggregating their results with social choice theory, specifically Partial Borda Weighting (PBW). It formalizes a transformation framework, using a transformation T(Q,N,t) to generate multiple ranking profiles from an input query and applying A_{PBW} to obtain a single robust ranking, with $w^{PBW}_{\succeq}(o) = 2 \cdot \mathrm{Down}_{\succeq}(o) + \mathrm{Inc}_{\succeq}(o)$ and $s^{PBW}_{p}(o) = \sum_{i=1}^N w^{PBW}_{\succeq_i}(o)$, followed by $\overline{s}^{PBW}(o) = s^{PBW}(o) / \sum_{o'} s^{PBW}(o')$ and $f^{PBW}(p) = \arg \max_{o} s^{PBW}_{p}(o)$. The approach is evaluated on manufacturing, finance, and medical ranking tasks, showing that PBW-based aggregation improves robustness to both query and syntax uncertainty relative to baselines that do not aggregate or use simple averaging. The results demonstrate that even small numbers of aggregated responses can yield substantial improvements in rank stability, suggesting practical utility for domain-specific, high-accuracy LLM applications. The work highlights a principled, interpretable uncertainty-quantification pathway for LLMs that leverages established social-choice mechanisms without requiring fine-tuning or access to proprietary model internals.

Abstract

Large-language models (LLMs) can support a wide range of applications like conversational agents, creative writing or general query answering. However, they are ill-suited for query answering in high-stake domains like medicine because they are typically not robust - even the same query can result in different answers when prompted multiple times. In order to improve the robustness of LLM queries, we propose using ranking queries repeatedly and to aggregate the queries using methods from social choice theory. We study ranking queries in diagnostic settings like medical and fault diagnosis and discuss how the Partial Borda Choice function from the literature can be applied to merge multiple query results. We discuss some additional interesting properties in our setting and evaluate the robustness of our approach empirically.

Robust Knowledge Extraction from Large Language Models using Social Choice Theory

TL;DR

This paper addresses the robustness shortcomings of large language models (LLMs) for high-stakes query answering by proposing repeated ranking queries and aggregating their results with social choice theory, specifically Partial Borda Weighting (PBW). It formalizes a transformation framework, using a transformation T(Q,N,t) to generate multiple ranking profiles from an input query and applying A_{PBW} to obtain a single robust ranking, with and , followed by and . The approach is evaluated on manufacturing, finance, and medical ranking tasks, showing that PBW-based aggregation improves robustness to both query and syntax uncertainty relative to baselines that do not aggregate or use simple averaging. The results demonstrate that even small numbers of aggregated responses can yield substantial improvements in rank stability, suggesting practical utility for domain-specific, high-accuracy LLM applications. The work highlights a principled, interpretable uncertainty-quantification pathway for LLMs that leverages established social-choice mechanisms without requiring fine-tuning or access to proprietary model internals.

Abstract

Large-language models (LLMs) can support a wide range of applications like conversational agents, creative writing or general query answering. However, they are ill-suited for query answering in high-stake domains like medicine because they are typically not robust - even the same query can result in different answers when prompted multiple times. In order to improve the robustness of LLM queries, we propose using ranking queries repeatedly and to aggregate the queries using methods from social choice theory. We study ranking queries in diagnostic settings like medical and fault diagnosis and discuss how the Partial Borda Choice function from the literature can be applied to merge multiple query results. We discuss some additional interesting properties in our setting and evaluate the robustness of our approach empirically.
Paper Structure (26 sections, 3 theorems, 13 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 26 sections, 3 theorems, 13 equations, 9 figures, 6 tables, 2 algorithms.

Key Result

theorem 1

Figures (9)

  • Figure 1: Query templates for evaluating query uncertainty
  • Figure 2: Syntactic variants of the manufacturing query.
  • Figure 3: Robustness with respect to the number of answers used for aggregation.
  • Figure :
  • Figure :
  • ...and 4 more figures

Theorems & Definitions (5)

  • definition 1: PBW Weighting
  • theorem 1: cullinan2014borda
  • theorem 2: cullinan2014borda
  • definition 2
  • proposition 1