Table of Contents
Fetching ...

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong

TL;DR

<3-5 sentence high-level summary> This work investigates whether a lightweight Narrow Safety Proxy can predict LLM jailbreak outcomes in a black-box setting. It introduces the Outline Filling Attack to generate dense samples of security boundaries and a Ranking Regression framework to handle ASR domain shift, culminating in a global scoring method for attack optimization. The results show that ASR and ALR are largely predictable across multiple LLMs, that ranking regression achieves strong ordinal accuracy and generalizes to unseen prompts, and that proxy-guided scoring substantially reduces the cost of first successful jailbreaks. These findings highlight the distillability of LLM safety mechanisms and have implications for both advancing defensive research and informing attack-aware safety design.

Abstract

In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

TL;DR

<3-5 sentence high-level summary> This work investigates whether a lightweight Narrow Safety Proxy can predict LLM jailbreak outcomes in a black-box setting. It introduces the Outline Filling Attack to generate dense samples of security boundaries and a Ranking Regression framework to handle ASR domain shift, culminating in a global scoring method for attack optimization. The results show that ASR and ALR are largely predictable across multiple LLMs, that ranking regression achieves strong ordinal accuracy and generalizes to unseen prompts, and that proxy-guided scoring substantially reduces the cost of first successful jailbreaks. These findings highlight the distillability of LLM safety mechanisms and have implications for both advancing defensive research and informing attack-aware safety design.

Abstract

In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.

Paper Structure

This paper contains 35 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Conceptual illustration of the proposed Narrow Safety Proxy framework. (a) A shield with holes represents the flawed security mechanism of LLMs, where only a few attacks can penetrate. (b) Represents the black-box privilege setting. (c) "Engraving the shield onto a target" symbolizes constructing a safety proxy model. Interacting with this proxy in a white-box manner enhances the attacker's capabilities. (d) The attacker returns to the black-box setting with improved jailbreak capabilities and a higher success rate.
  • Figure 2: The overall framework of the proposed method, illustrating the pipeline from Outline Filling Attack sampling to Ranking Regression training. (a) Generate attack instruction groups for each dangerous question using Outline Filling Attack. (b) Repeatedly input each attack instruction into the target LLM. (c) Classify the target LLM's responses into three categories: short refusal, long safe response, and long dangerous response; long dangerous responses correspond to ASR, while long responses correspond to ALR. (d) Pair instructions within each dangerous question's set and label them with binary ordinal relationships based on actual $ASR_{p1}$ and $ALR_{p2}$ to create a fine-tuning dataset for training the Ranking Regression proxy. (e) Use the proxy to determine the pairwise ordinal relationship of new attack instructions, generating a probability between 0 and 1. (f) Derive the most probable ordinal ranking of ASR based on the generated matrix and select the instruction with the highest predicted danger.
  • Figure 3: Distribution of Average ASR, Average ALR, Non-zero ASR count, and Non-zero ALR count across all questions.
  • Figure 4: ASR distribution visualization.
  • Figure 5: Prediction success rate relative to ASR distribution, indicating absence of reward hacking.