Table of Contents
Fetching ...

Conformal Tail Risk Control for Large Language Model Alignment

Catherine Yu-Chi Chen, Jingyan Shen, Zhun Deng, Lihua Lei

TL;DR

The paper tackles tail-risk in large language model outputs by aligning human disutility with machine scores through distortion risk measures. It introduces a light calibration framework that treats the human risk as a black-box target and uses a univariate threshold $\hat{\lambda}$, selected via a conformal upper confidence bound, to control $R_\psi(F_{r_{\hat{\lambda}}})$ without retraining the model. The core contributions include establishing PAC-style guarantees for distortion-risk control via L-statistics, deriving asymptotic normality with consistent variance estimators, and proposing practical deployment strategies with finite-sample confidence. Empirical results on toxicity tasks show the proposed method (CDRC-L) achieves risk control with less conservatism and lower deployment costs than DKW- and Berk-Jones-based baselines, particularly as human-machine misalignment improves, underscoring its practical value for safe LLM deployment.

Abstract

Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.

Conformal Tail Risk Control for Large Language Model Alignment

TL;DR

The paper tackles tail-risk in large language model outputs by aligning human disutility with machine scores through distortion risk measures. It introduces a light calibration framework that treats the human risk as a black-box target and uses a univariate threshold , selected via a conformal upper confidence bound, to control without retraining the model. The core contributions include establishing PAC-style guarantees for distortion-risk control via L-statistics, deriving asymptotic normality with consistent variance estimators, and proposing practical deployment strategies with finite-sample confidence. Empirical results on toxicity tasks show the proposed method (CDRC-L) achieves risk control with less conservatism and lower deployment costs than DKW- and Berk-Jones-based baselines, particularly as human-machine misalignment improves, underscoring its practical value for safe LLM deployment.

Abstract

Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.

Paper Structure

This paper contains 35 sections, 5 theorems, 51 equations, 13 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

Assume $r_{\lambda}(x)\in [a, b]$ almost surely for some $-\infty < a < b < \infty$, $F_\lambda$ is continuous and strictly increasing. Further, assume that $\psi(y) = \int_{0}^{y}\psi^{\prime}(z) dz$ for some $\psi'$ that is bounded and continuous at $F_{\lambda}(r)$ for Lebesgue almost-every $r$. where with $\hat{F}_{n,\lambda}$ being the empirical distribution of $r_\lambda(x_1), \ldots, r_\l

Figures (13)

  • Figure 1: Examples of distortion risk measures: Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR).
  • Figure 2: Illustration of $\hat{\lambda}$. We choose $\hat{\lambda}$ as the last $\lambda$ such that the (asymptotic) upper confidence bound $\hat{R}_{\psi}^+(\lambda)$ falls below $\alpha$.
  • Figure 3: Illustration of the process to generate $\mathcal{C}(x), \mathcal{C}_\lambda(x)$, and $r_{\lambda}(x)$. We sample $N$ responses $y_i$, $i = 1, \ldots, N$ from the LLM $p(y \mid x)$. Each response is associated with a machine disutility score ${r}_m(y_i(x))$, and a human-rated disutility score $r(y_i(x))$. To construct $\mathcal{C}_{\lambda_j}(x)$, we keep the responses such that the machine toxicity score satisfies ${r}_m(y_i(x)) \leq \lambda_j$ for each $\lambda_j \in \Lambda$. Finally, we compute the induced score $r_{\lambda}(x)$ by taking the maximum human disutility score of each $\mathcal{C}_{\lambda_j}(x)$.
  • Figure 4: Average sampling cost vs. $\alpha$ (row 1), and realized $\operatorname{CVaR}$ vs. $\alpha$ (row 2) for Spearman correlation between human and machine toxicity scores at $\rho = 0.57$ evaluated on held-out dataset. The confidence band is computed by taking the mean estimate plus/minus one standard error estimated from the results across independent experiments. Each subplot in the respective rows illustrates a different setting of $\beta \in \{0.5, 0.75, 0.9\}$. From the panels in the first row, we observe that our method, CDRC-L ( orange), is an improvement to CDRC-DKW (green) and CDRC-BJ (blue), as it is able to achieve risk control (shown in the black dotted line) while being less conservative than both baseline methods. Evident from the panels in the second row, our method is more efficient in generating a risk-controlled LLM response than DKW or BJ.
  • Figure 5: Average sampling cost vs. $\alpha$ (row 1), and realized $\operatorname{VaR}$ vs. $\alpha$ (row 2) for Spearman correlation between human and machine toxicity scores at $\rho = 0.57$ evaluated on held-out dataset. The confidence band is computed by taking the mean estimate plus/minus one standard error estimated from the results across the independent experiments. Each panel in the respective rows illustrates a different setting of $\beta \in \{0.5, 0.75, 0.9\}$. CDRC-L ( orange), is an improvement to CDRC-DKW (green) and CDRC-BJ (blue), as it is able to achieve risk control within the margin of error (shown in the black dotted line) while being more efficient than both baseline methods.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Theorem 3.1
  • Corollary 3.2
  • Theorem 3.3
  • Theorem C.1: Asymptotic normality of L-statistics
  • proof
  • Theorem C.2: Consistent variance estimate for L-Statistics
  • proof