Table of Contents
Fetching ...

Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation

Zebin Wang, Yi Han, Ethan X. Fang, Lan Wang, Junwei Lu

TL;DR

This work addresses the challenge of inferring context-dependent rankings among $n$ large language models under a nonparametric scoring regime. It introduces a kernel-smoothed contextual ranking framework, a confidence diagram visualizing the global ranking uncertainty via a Hasse diagram, and a Gaussian multiplier bootstrap that handles supremums over a continuous prompt domain with non-identically distributed scores. Theoretical guarantees include a convergence rate for the regularized kernel MLE and the validity of confidence bands, hypothesis tests, and the confidence diagram. Empirical validation on synthetic data and real medical-domain data demonstrates accurate estimation, reliable uncertainty quantification, and informative ranking structures that support domain-specific model selection and alignment efforts.

Abstract

We consider the inference for the ranking of large language models (LLMs). Alignment arises as a significant challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has proven to be an effective tool to improve alignment based on the best-of-$N$ policy. In this paper, we propose a new inferential framework for hypothesis testing among the ranking for language models. Our framework is based on a nonparametric contextual ranking framework designed to assess large language models' domain-specific expertise, leveraging nonparametric scoring methods to account for their sensitivity to the prompts. To characterize the combinatorial complexity of the ranking, we introduce a novel concept of confidence diagram, which leverages a Hasse diagram to represent the entire confidence set of rankings by a single directed graph. We show the validity of the proposed confidence diagram by advancing the Gaussian multiplier bootstrap theory to accommodate the supremum of independent empirical processes that are not necessarily identically distributed. Extensive numerical experiments conducted on both synthetic and real data demonstrate that our approach offers valuable insight into the evaluation for the performance of different LLMs across various medical domains.

Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation

TL;DR

This work addresses the challenge of inferring context-dependent rankings among large language models under a nonparametric scoring regime. It introduces a kernel-smoothed contextual ranking framework, a confidence diagram visualizing the global ranking uncertainty via a Hasse diagram, and a Gaussian multiplier bootstrap that handles supremums over a continuous prompt domain with non-identically distributed scores. Theoretical guarantees include a convergence rate for the regularized kernel MLE and the validity of confidence bands, hypothesis tests, and the confidence diagram. Empirical validation on synthetic data and real medical-domain data demonstrates accurate estimation, reliable uncertainty quantification, and informative ranking structures that support domain-specific model selection and alignment efforts.

Abstract

We consider the inference for the ranking of large language models (LLMs). Alignment arises as a significant challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has proven to be an effective tool to improve alignment based on the best-of- policy. In this paper, we propose a new inferential framework for hypothesis testing among the ranking for language models. Our framework is based on a nonparametric contextual ranking framework designed to assess large language models' domain-specific expertise, leveraging nonparametric scoring methods to account for their sensitivity to the prompts. To characterize the combinatorial complexity of the ranking, we introduce a novel concept of confidence diagram, which leverages a Hasse diagram to represent the entire confidence set of rankings by a single directed graph. We show the validity of the proposed confidence diagram by advancing the Gaussian multiplier bootstrap theory to accommodate the supremum of independent empirical processes that are not necessarily identically distributed. Extensive numerical experiments conducted on both synthetic and real data demonstrate that our approach offers valuable insight into the evaluation for the performance of different LLMs across various medical domains.

Paper Structure

This paper contains 43 sections, 42 theorems, 370 equations, 8 figures, 3 algorithms.

Key Result

Theorem 4.3

Suppose that the assumptions in Section sec:assumptions hold. There exists a positive constant $C$ such that the regularized MLE $\widehat{\bm{\theta}}(\mathbf{x})$ satisfies that

Figures (8)

  • Figure 1: Nodes 1 through 8 represent eight models, with directed edges indicating the strict partial order "$<$" based on the pairwise comparison.
  • Figure 2: The performance of the proposed estimators, comparing the MSE with different $n$, $L$ and $p$. (A) We set $p=0.5$ and vary $n$ and $L$. (B) We set $n=20$ and vary $L$ and $p$.
  • Figure 3: The converge of the confidence band over the ground truth with different $n$, $L$, $p$ and the model index $i$. We fix $x_3 = 0.4$ and vary $x_1$ and $x_2$ uniformly over $[0,1]$.
  • Figure 4: Example of the constructed Hasse diagram with $n=20$, $L=100$ and $p=0.2$.
  • Figure 5: Heatmap for the possible ranks indicated by the confidence diagram, comparing the frequency of the possible ranks with the true ranks. (A) We fix $n=20$, $L=50$ and $p=0.2$. (B) We fix $n=20$, $L=100$ and $p=0.2$.
  • ...and 3 more figures

Theorems & Definitions (49)

  • Example 1.1: Pairwise Inference
  • Example 1.2: Top-$K$ Inference
  • Example 1.3: Confidence Diagram
  • Remark 4.1
  • Remark 4.2
  • Theorem 4.3
  • Remark 4.4
  • Theorem 4.5
  • Corollary 4.6
  • Theorem 4.7
  • ...and 39 more