Table of Contents
Fetching ...

Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

Hen Davidov, Shai Feldman, Gilad Freidkin, Yaniv Romano

TL;DR

This work defines time-to-unsafe-sampling $T$ as a per-prompt safety metric for LLMs and casts its estimation under a finite sampling budget as a survival-analysis problem. It constructs a Probably Approximately Correct lower predictive bound $\hat{L}(X)$ with finite-sample guarantees using a calibrated conformal approach, and introduces an optimized censoring-budget allocation to improve sample efficiency while preserving coverage. The method is validated on synthetic data and a RealToxicityPrompts experiment with large-scale generation, showing that the Optimized calibration yields valid, informative LPBs that scale with budget and provide practical risk assessment for prompt-level safety. The approach enables proactive safety decisions, such as adaptive auditing and allocation of computational resources, by quantifying how many safe responses to expect before unsafe content may occur. Overall, the paper advances field-ready tools for proactive safety evaluation of generative AI under realistic resource constraints.

Abstract

We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

TL;DR

This work defines time-to-unsafe-sampling as a per-prompt safety metric for LLMs and casts its estimation under a finite sampling budget as a survival-analysis problem. It constructs a Probably Approximately Correct lower predictive bound with finite-sample guarantees using a calibrated conformal approach, and introduces an optimized censoring-budget allocation to improve sample efficiency while preserving coverage. The method is validated on synthetic data and a RealToxicityPrompts experiment with large-scale generation, showing that the Optimized calibration yields valid, informative LPBs that scale with budget and provide practical risk assessment for prompt-level safety. The approach enables proactive safety decisions, such as adaptive auditing and allocation of computational resources, by quantifying how many safe responses to expect before unsafe content may occur. Overall, the paper advances field-ready tools for proactive safety evaluation of generative AI under realistic resource constraints.

Abstract

We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

Paper Structure

This paper contains 43 sections, 7 theorems, 75 equations, 7 figures, 1 table, 4 algorithms.

Key Result

Theorem 4.1

Fix a tolerance level $\delta \in \left(0,1\right)$ and a miscoverage level $\tau\in \left(0,1\right)$. Suppose that $\{(X_i, T_i)\}_{i=1}^{n}$ and $(X_\textup{test}, T_\textup{test})$ are drawn i.i.d and that the censoring times satisfy the conditional independence assumption (Assumption assumption

Figures (7)

  • Figure 1: Synthetic experiments. Left: Coverage (target 90%). Center: Mean number of samplings generated per prompt. Right: Mean LPB. Shaded regions represent the standard deviation over 20 runs.
  • Figure 2: RealToxicityPrompts dataset experiment. Left: Empirical coverage rate (target 90%). The true coverage of the Optimized method is in a solid red line, while the upper and lower bounds on the coverage of the Uncalibrated and Naive methods correspond to the dotted lines. Right: Mean LPB. Shaded regions represent the standard deviation, computed over $5$ runs. Higher is better.
  • Figure 3: Results of synthetic experiments as a function of average budget per prompt $B/|{\mathcal{I}_2}|$. Left: Coverage (target 90%; gray dashed line). Center: Mean number of samplings generated per prompt. Right: Mean LPB. Shaded regions represent the standard deviation over 20 runs.
  • Figure 4: Results of synthetic experiments as a function of average budget per prompt $B/|{\mathcal{I}_2}|$. Top: Coverage (target level indicated by a gray dashed line). Bottom: Mean LPB. Shaded regions represent the standard deviation over 20 runs.
  • Figure 5: Results of synthetic experiments as a function of the maximum weight $w_\text{max}$ of the Trimmed and Optimized methods. Left: Coverage (target 90%; gray dashed line). Right: Average LPB. Shaded regions present the standard deviation across $20$ runs.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Theorem 4.1: General validity, informal
  • Proposition 5.1: Maximal weight bound
  • Proposition B.1
  • Proposition B.2
  • Proposition C.1: Unbiasedness and conditional variance of the weighted miscoverage estimator
  • Proposition C.2: Variance linearly increasing in mean weight under constant miscoverage
  • proof
  • proof : Proof of Proposition \ref{['prop:variance_constant_miscoverage']}
  • proof : Proof of Proposition \ref{['prop:budget‐tight']}
  • proof : Proof of Proposition \ref{['prop:unique_lambda']}
  • ...and 3 more