Table of Contents
Fetching ...

HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models

Robab Aghazadeh-Chakherlou, Qing Guo, Siddartha Khastgir, Peter Popov, Xiaoge Zhang, Xingyu Zhao

TL;DR

Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches.

Abstract

Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.

HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models

TL;DR

Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches.

Abstract

Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.

Paper Structure

This paper contains 44 sections, 7 theorems, 95 equations, 8 figures, 2 tables.

Key Result

Theorem 1

For subdomain $S_{ij}$ in domain $D_i$, let $C_i=\{(C_{ik},N_{ik})\}_{k=1}^{n_i}$ be the observed data. Let the admissible set of hyperparameters be and write $h_i=(a_i,b_i,c_i,d_i)$. Then, for any$h_i \in \mathcal{A}_i$, the marginal posterior density of $\theta_{ij}$ is where $f_{\mathrm{marg}}$ (unnormalized posterior) and $Z_{\mathrm{marg}}$ (normalizing constant) are with $L(\boldsymbol{\t

Figures (8)

  • Figure 1: Schematic representation of the hierarchical LLM, domain, and subdomain structure for reliability estimation fo M LLM models.
  • Figure 2: Hierarchical structure with independent domains and dependent subdomains. For readability, parameters and priors are shown only for one subdomain $S_{ij}$ and its parent domain $D_i$ under a representative $LLM_k$; the remaining subdomains, domains ($i=1, \dots, m$), and LLMs $(k=1, \dots ,M)$ are identical, differing only in their indices.
  • Figure 3: Posterior CDF envelopes at subdomain level. Rows correspond to domains (D$_1$ (coding): MBPP, DS-1000; D$_2$ (reasoning): BoolQ, RACE-H).
  • Figure 4: Posterior CDF envelopes for domain-level reliabilities $p_i=\sum_j\Omega_{ij}\theta_{ij}$. Operational profile weights are $\Omega_{1\cdot}=(0.204,0.796)$ for D1 and $\Omega_{2\cdot}=(0.483,0.517)$ for D2.
  • Figure 5: Posterior CDF envelope for overall LLM reliability $p_L=\sum_i W_i p_i$ with domain weights $W=(0.149,\,0.851)$.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Definition 1: Software reliability
  • Definition 2: Operational Profile (OP)
  • Definition 3: LLM reliability
  • Definition 4: Formalised LLM reliability
  • Theorem 1: Sub-domain level non-failure probability
  • Theorem 2: Domain level posterior non-failure probability
  • Theorem 3: LLM-level posterior non-failure probability
  • Theorem 4: Subdomain posterior reliability for $n_F$ future operations
  • Theorem 5: Domain posterior reliability for $n_F$ future operations
  • Theorem 6: LLM posterior reliability for $n_F$ future operations
  • ...and 5 more