Table of Contents
Fetching ...

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty

Abstract

Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Abstract

Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.
Paper Structure (42 sections, 1 theorem, 5 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 1 theorem, 5 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Theorem 4.1

Let $J(A)$ be the Max-Avg objective function defined in Eq. eq:max_avg_div. If $d(\cdot, \cdot)$ is a metric satisfying the triangle inequality, and $A^\star$ and $\hat{A}$ denote an optimal solution and a solution returned by Algorithm algo:2 respectively, then $J(\hat{A}) \geq \frac{1}{2}J(A^\star

Figures (8)

  • Figure 1: Illustration of input-space versus output-space search for jailbreaking. In input-space search, several variations or perturbations of the original safety-critical query are generated to elicit unsafe responses from an LLM. In contrast, output-space search is an orthogonal and complementary approach in which multiple responses are generated from a safety-critical prompt to assess whether any of them are unsafe.
  • Figure 2: Attack Success Rate (ASR) trends on Qwen2.5-7B-Instruct. The plots show ASR across $\text{HarmBench}$, $\text{JailbreakBench}$, $\text{AdvBench}$, and $\text{MaliciousInstruct}$ datasets as a function of: (a) total number of generations ($N$), (b) sampling temperature ($\tau$), and (c) nucleus sampling probability ($p$), while holding other parameters constant. The results demonstrate that broader exploration of the output space, whether through increased sample size or higher stochasticity leads to a monotonic increase in ASR.
  • Figure 3: 2D plots of embeddings of $1024$ responses generated from Llama-2-13B-Chat for three safety-critical prompts. The color indicates the likelihood of a response being unsafe.
  • Figure 4: Illustration of the difference between (a) IID sampling and (b) PDPS. IID sampling generates a large number of long responses through diverse token-level sampling. In contrast, PDPS produces a small set of diverse responses that approximate the modality coverage of large-scale IID sampling, while retaining the computational efficiency of small-scale IID sampling.
  • Figure 5: Bar plot of the mean toxicity difference, defined as the difference between the average number of unsafe responses returned by $\texttt{PDPS}$ and that returned by a baseline method, across various model–dataset combinations. The average is computed over only those queries for which the respective method achieves a successful attack. Blue bars indicate a positive difference, while red bars indicate a negative difference.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 4.1