Table of Contents
Fetching ...

Conformal Policy Control

Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, Samuel Stanton

TL;DR

Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance, as well as providing finite-sample guarantees even for non-monotonic bounded constraint functions.

Abstract

An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.

Conformal Policy Control

TL;DR

Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance, as well as providing finite-sample guarantees even for non-monotonic bounded constraint functions.

Abstract

An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
Paper Structure (62 sections, 6 theorems, 121 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 62 sections, 6 theorems, 121 equations, 9 figures, 5 tables, 2 algorithms.

Key Result

Theorem 4.2

Assume exchangeable $L_i(\lambda)$. Define $\lambda_{\max}:=\sup\Lambda\in \Lambda$ and assume If the $L_i(\lambda)$ are $K$-Lipschitz in $\lambda$ and $\hat{\lambda}_+$ is $\epsilon$-replace-one stable, then

Figures (9)

  • Figure 1: An illustration of safe exploration with conformal policy control. Starting with a safe policy $\pi_0$ and safe context prompts $\mathcal{X}_0:=\{x_1, \dots, x_m\}$, we observe the safe policy's actions $a_i$ and corresponding reward $r_i \in \mathbb{R}$ and constraint violation $\ell_i \in \{0, 1\}$. We split the observations into train and calibration data, and optimize the safe policy to obtain $\pi_t$. We then query the user for their constraint violation risk tolerance $\alpha$, and apply conformal risk control to find a bound on the likelihood ratio $\pi_t / \pi_0$ which guarantees the calibrated risk is controlled at level $\alpha$. Finally we apply rejection sampling to probabilistically regulate the optimized policy for deployment and observation, allowing the user to flexibly trade off reward, constraint risk, and test-time compute.
  • Figure 2: Empirical comparison of standard CRC angelopoulos2022conformal (orange) and the proposed generalized CRC (blue) on synthetic non-monotonic losses. gCRC achieves risk control across various target risk levels while CRC does not (left). On an example empirical risk trajectory for these experiments, CRC underestimates the risk while gCRC maintains risk control by searching from safe to aggressive hyperparameter values (right).
  • Figure 3: Rejection sampling for policy interpolation at three $\beta$ values. As $\beta$ increases from near-zero to large values, the constrained policy $\pi^{\beta}$ interpolates from the safe policy $\pi_0$ to the optimized policy $\pi_t$. Blue histogram bars show accepted samples matching $\pi^{\beta}$, while teal/red bars show rejected samples from the proposal distribution. (a) Small $\beta$ nearly recovers the safe policy (b) Intermediate $\beta$ shows intermediate interpolation with frequent rejections. (c) Large $\beta$ approaches the optimized policy with minimal rejection.
  • Figure 4: Medical QA factuality control results on MedLFQA. Left: Empirical FDR (fraction of retained claims that are false) vs. target risk level $\alpha$. The dashed line indicates $y = x$; valid methods should fall at or below this line. Right: Recall (fraction of true claims retained) vs. $\alpha$. gCRC tightly controls FDR at the target level while achieving superior recall to baselines. Results averaged over 25 random splits. Error regions are standard errors.
  • Figure 5: Conformal policy control in the active learning setting. We train Gaussian process regression models on tabular datasets, providing a small amount of initial data for training and selecting the remaining training data via exponential tilting toward the posterior predictive variance. We introduce a feasibility constraint based on alignment with the leading principal component of the Gram matrix to make the task more difficult. CPC controls the constraint violation risk at our desired threshold $\alpha = 0.2$, while simultaneously reducing test MSE. Surprisingly, in some cases the risk-controlled data selection policy attains lower test MSE than the uncontrolled policy.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Definition 4.1: $\epsilon$-replace-one stability
  • Theorem 4.2
  • Remark 4.3: Attaining $\alpha$-validity if $K$ is known
  • Remark 4.4: Monotone envelope interpretation
  • Theorem 4.5
  • Theorem 2.1: Restatement of Theorem \ref{['thm:gcrc_nonmonotonic']}
  • proof
  • Theorem 2.2: Restatement of Theorem \ref{['thm:cpc']}
  • proof
  • Proposition 2.3
  • ...and 3 more