Table of Contents
Fetching ...

Fast Proxies for LLM Robustness Evaluation

Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan Günnemann

TL;DR

This work addresses the challenge of expensive real-world red-teaming for evaluating LLM robustness by proposing fast proxy metrics. It introduces three proxies—direct prompting, prefilling, and embedding-space attacks—and validates them against a synthetic red-teamer ensemble of six attack methods across 33 models and 300 prompts, yielding over 7M jailbreak attempts. The proxies achieve strong predictive power relative to the full ensemble, with correlations such as $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman) while reducing compute by about three orders of magnitude. The findings hold for within- and across-family comparisons and for robustness fine-tuning scenarios, suggesting scalable, cost-effective pathways for safety evaluation and checkpoint selection in adversarial training.

Abstract

Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.

Fast Proxies for LLM Robustness Evaluation

TL;DR

This work addresses the challenge of expensive real-world red-teaming for evaluating LLM robustness by proposing fast proxy metrics. It introduces three proxies—direct prompting, prefilling, and embedding-space attacks—and validates them against a synthetic red-teamer ensemble of six attack methods across 33 models and 300 prompts, yielding over 7M jailbreak attempts. The proxies achieve strong predictive power relative to the full ensemble, with correlations such as (linear) and (Spearman) while reducing compute by about three orders of magnitude. The findings hold for within- and across-family comparisons and for robustness fine-tuning scenarios, suggesting scalable, cost-effective pathways for safety evaluation and checkpoint selection in adversarial training.

Abstract

Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving (linear) and (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.

Paper Structure

This paper contains 16 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Attack success rates for different variants of Llama-3-8B. We include instruct versions ( ) as a baseline and compare to safety-tuned ( ), adversarially trained ( $\blacktriangle$, $\blacktriangledown$), circuit breaker ( $\blacklozenge$), and capability-optimized ( , ) models.
  • Figure 2: Attack success rates for models from different families. Direct ASR has the largest $r_s$ and $\tau$, i.e., the order of two models w.r.t. direct ASR is most predictive of order w.r.t. ensemble ASR.
  • Figure 3: Attack success rates for different number of robustness fine-tuning steps using Circuit Breakers zou2024improving. We include the base instruct model and the officially released circuit breaker model. Despite varying success rate, all proxy methods have similar correlation coefficients, i.e., are similarly predictive of fine-tuning effectiveness. Arrows indicate training progression.
  • Figure 4: Correlation coefficients between proxy attack success rate and ensemble attack success rate under varying number of prompts. When using fewer than 50 prompts, PGD yields higher Spearman and Kendall ranking correlations, however the direct prompting scales better with more prompts. Prefilling and PGD achieve higher linear/Pearson correlations at any prompt count.
  • Figure 5: Attack success rates for different variants of Mistral 7B Instruct. We include instruct versions ( ) as a baseline and compare to safety-tuned ( ), adversarially trained ( $\blacktriangle$), circuit breaker ( $\blacklozenge$), and capability-optimized ( ) models.
  • ...and 2 more figures