Table of Contents
Fetching ...

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Keita Broadwater

TL;DR

APST presents a depth-oriented framework for evaluating LLM safety under repeated inference by treating each generation as an independent Bernoulli trial to estimate per-inference failure probabilities. It combines a calibration phase with a cross-model, depth-focused phase, using AIR-BENCH prompts to compare shallow and repeated-sampling safety across models and decoding settings. The key finding is that models with similar shallow safety scores can display markedly different inference-level reliability under sustained use, with failure probabilities increasing with sampling depth and varying by risk category and temperature. This approach translates evaluation outcomes into deployment-relevant risk metrics, enabling cost-reliability tradeoffs and more informed decisions about model configuration and operational risk in high-stakes settings.

Abstract

Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In high-stakes settings, response consistency and safety under sustained use are critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (e.g., decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of independent inference events. We formalize safety failures using Bernoulli and binomial models to estimate per-inference failure probabilities, enabling quantitative comparison of reliability across models and decoding configurations. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH-derived safety prompts, we find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling, particularly as temperature increases. These results demonstrate that shallow, single-sample evaluation can obscure meaningful reliability differences under sustained use. APST complements existing benchmarks by providing a practical framework for evaluating LLM safety and reliability under repeated inference, bridging benchmark alignment and deployment-oriented risk assessment.

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

TL;DR

APST presents a depth-oriented framework for evaluating LLM safety under repeated inference by treating each generation as an independent Bernoulli trial to estimate per-inference failure probabilities. It combines a calibration phase with a cross-model, depth-focused phase, using AIR-BENCH prompts to compare shallow and repeated-sampling safety across models and decoding settings. The key finding is that models with similar shallow safety scores can display markedly different inference-level reliability under sustained use, with failure probabilities increasing with sampling depth and varying by risk category and temperature. This approach translates evaluation outcomes into deployment-relevant risk metrics, enabling cost-reliability tradeoffs and more informed decisions about model configuration and operational risk in high-stakes settings.

Abstract

Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In high-stakes settings, response consistency and safety under sustained use are critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (e.g., decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of independent inference events. We formalize safety failures using Bernoulli and binomial models to estimate per-inference failure probabilities, enabling quantitative comparison of reliability across models and decoding configurations. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH-derived safety prompts, we find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling, particularly as temperature increases. These results demonstrate that shallow, single-sample evaluation can obscure meaningful reliability differences under sustained use. APST complements existing benchmarks by providing a practical framework for evaluating LLM safety and reliability under repeated inference, bridging benchmark alignment and deployment-oriented risk assessment.
Paper Structure (56 sections, 1 equation, 9 figures, 7 tables)

This paper contains 56 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Conceptual comparison of LLM safety evaluation paradigms along evaluation breadth (number of prompts or risk categories) and evaluation depth (repeated sampling enabling statistical estimation). Standard benchmarks emphasize breadth with primarily single-sample or shallow evaluation, while adversarial testing emphasizes adaptive depth over a narrow prompt set. Accelerated Prompt Stress Testing (APST) occupies a distinct statistical-depth regime, combining category-level coverage with repeated sampling to estimate empirical failure probabilities under sustained inference.
  • Figure 2: Empirical failure probability by temperature for the Phase 1 calibration model (Gemma-3N-E4B-it) for Phase 1 calibration prompts. Each bar aggregates outcomes across repeated generations of identical prompts at fixed decoding configurations. Non-zero failure probabilities are observed, demonstrating that stochastic safety failures persist even under conservative evaluation conditions.
  • Figure 3: Empirical failure probability as a function of sampling depth for Phase 1 calibration. Failure probability estimates increase from near-zero at shallow depth and stabilize only after moderate numbers of repeated samples. This illustrates that single-sample or low-depth evaluation systematically underestimates operational failure risk. Here, "cumulative" refers to aggregation over increasing sample counts for a fixed configuration, not temporal accumulation or time-dependent behavior.
  • Figure 4: Empirical CDF of prompt-level harm under repeated sampling. Each curve shows the empirical distribution of the number of harmful outputs observed per prompt over 100 repeated generations, aggregated across 20 prompts for a fixed decoding temperature. This ECDF reflects observed variability across prompts without parametric modeling or extrapolation. Even under conservative decoding, a non-trivial fraction of prompts exhibit multiple harmful outputs, highlighting heterogeneity that is invisible to single-sample or shallow evaluation.
  • Figure 5: Mean AIR-BENCH–equivalent safety score by model under shallow evaluation (Phase 2A). Scores are aggregated across all prompts and risk categories using the AIR-BENCH three-level rubric at $T=0.0$, $N=3$.
  • ...and 4 more figures