Table of Contents
Fetching ...

Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees

Yu Gui, Ying Jin, Zhimei Ren

TL;DR

This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion, and demonstrates that the method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data.

Abstract

Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.

Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees

TL;DR

This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion, and demonstrates that the method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data.

Abstract

Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
Paper Structure (55 sections, 3 theorems, 25 equations, 26 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 3 theorems, 25 equations, 26 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose that for any $j\in[m]$, $\{Z_{n+j}\} \cup \{Z_i\}_{i\in \mathcal{D}_{\textnormal{cal}}}$ are exchangeable conditional on $\{Z_{n+\ell}\}_{\ell \neq j}$, i.e., for any permutation $\pi$ of $\{1,\dots,n,n+j\}$ and any $\{z_1,\dots,z_n,z_{n+j}\}$, it holds that Suppose the predicted alignment score $\{\widehat{A}_i\}_{i\in \mathcal{D}_{\textnormal{cal}} \cup \mathcal{D}_\textnormal{test}}$ h

Figures (26)

  • Figure 1: Pipeline of Conformal Alignment instantiated in the radiology report generation example.
  • Figure 2: Visualization of asymptotic selection rule (red dashed line), with density curves of $g(X)$ for $A\leq c$ (red) and $A>c$ (blue).
  • Figure 3: Realized FDR (blue) and power (red) for conformal alignment applied to the TriviaQA dataset (with $\gamma_1=0.2, \gamma_2=0.5$) at various FDR target levels. The top row corresponds to the results from OPT-13B and the bottom row to those from LLaMa-2-13B-chat; each column corresponds to a value of $|\mathcal{D}|$. Shading represents the area between one standard deviation above and below the mean.
  • Figure 4: Comparison of FDR and power between Conformal Alignment and the heuristic baseline where we select units by thresholding self-evaluation scores Self_Eval with a cutoff at $1-$the target FDR level. The comparison is conducted on the TriviaQA dataset.
  • Figure 5: Power versus target FDR levels for TriviaQA dataset when the alignment predictor is trained with logistic regression over individual features with $|\mathcal{D}|=2000$, $\gamma_1=0.2$, $\gamma_2=0.5$, averaged over $500$ independent experiments. Note that the FDR is always controlled though not depicted.
  • ...and 21 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Remark 3.2
  • Proposition 3.3
  • Proposition A.1