Table of Contents
Fetching ...

Robust LLM safeguarding via refusal feature adversarial training

Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda

TL;DR

This work identifies a universal jailbreaking mechanism in LLMs: adversarial prompts disrupt safety by ablating the refusal feature in residual activations. It then introduces Refusal Feature Adversarial Training (ReFAT), an efficient training scheme that simulates worst-case perturbations by dynamically abating refusal feature directions, improving robustness across multiple models and attack types while preserving utility. The approach achieves strong reductions in attack success rates with substantially lower computational overhead than traditional adversarial training methods. The study links interpretability of linear features in activation space to practical defense, offering a scalable path toward safer, more reliable LLM deployment. Limitations include multilingual and vernacular prompts that may bypass the computed refusal directions, pointing to avenues for broader linguistic coverage in future work.

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

Robust LLM safeguarding via refusal feature adversarial training

TL;DR

This work identifies a universal jailbreaking mechanism in LLMs: adversarial prompts disrupt safety by ablating the refusal feature in residual activations. It then introduces Refusal Feature Adversarial Training (ReFAT), an efficient training scheme that simulates worst-case perturbations by dynamically abating refusal feature directions, improving robustness across multiple models and attack types while preserving utility. The approach achieves strong reductions in attack success rates with substantially lower computational overhead than traditional adversarial training methods. The study links interpretability of linear features in activation space to practical defense, offering a scalable path toward safer, more reliable LLM deployment. Limitations include multilingual and vernacular prompts that may bypass the computed refusal directions, pointing to avenues for broader linguistic coverage in future work.

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.
Paper Structure (32 sections, 10 equations, 19 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 19 figures, 6 tables, 1 algorithm.

Figures (19)

  • Figure 1: Upper panel: we show that adversarial attacks share a common mechanism consisting in ablating the refusal feature (RF) of harmful requests in LLM hidden representation space (the color sliders in the middle, where the right red extreme indicates high input harmfulness, and the left green extreme means high input safety), so that malicious prompts would look more benign and could therefore jailbreak the model. Lower panel: the ReFAT scheme, where we train LLMs to refuse harmful requests while ablating the RF during forward pass by pushing it towards the safe extreme, thus coercing the model to decide input harmfulness in a more robust way.
  • Figure 2: Layerwise cosine similarity between mean shift induced by four adversarial attacks and the negative vector of the refusal feature. Shaded areas denote 99% confidence intervals.
  • Figure 3: 2-D PCA visualization of: (1) harmful (dark red stars) vs. harmless (dark green stars) instructions; and (2) the original HarmBench instructions (light red dots) and their counterparts adversarially modified by attack algorithms (light green dots). All hidden representations are taken from the 16-th layer residual stream of Llama-3-8B-Instruct. The dark green arrows show the mean activation difference between harmful and harmless instructions (i.e. the negative vector of the refusal feature), and the light green arrows are mean adversarial representational shifts by attacks. The positions and norms of both shift vectors have been adjusted for better readability.
  • Figure 4: Changes in the attack success rate (ASR) of four LLM attacks after refusal features restoration (i.e., reset to the mean activation value of original harmful inputs, as in Equation \ref{['eq:causal-rf-filtering']}). Restoring refusal features dramatically reduces the effectiveness of the attacks.
  • Figure 5: Layerwise optimality of RFA as adversarial perturbation.
  • ...and 14 more figures