Table of Contents
Fetching ...

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Han Wang, Gang Wang, Huan Zhang

TL;DR

ASTRA is an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs.

Abstract

Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against unseen attacks (i.e., structured-based attack, perturbation-based attack with project gradient descent variants, and text-only attack). Our code is available at \url{https://github.com/ASTRAL-Group/ASTRA}.

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

TL;DR

ASTRA is an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs.

Abstract

Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against unseen attacks (i.e., structured-based attack, perturbation-based attack with project gradient descent variants, and text-only attack). Our code is available at \url{https://github.com/ASTRAL-Group/ASTRA}.

Paper Structure

This paper contains 29 sections, 6 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of our framework ASTRA. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform an adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little influence on benign inputs and a strong impact on adversarial inputs. The solid and dotted lines denote the activations $h^l$ and calibrated activations $h^l-h_0^l$ respectively. The blue refers to the calibration activation $h_0^l$. The color red denotes the case of adversarial inputs.
  • Figure 2: Illustration of steering. The colors red and green denote the activations for adversarial and benign inputs. The colors blue and brown denote the calibration activations $h^l_0$ and steering vectors $v^l$.
  • Figure 3: Transferability in ID scenarios. Avg. denotes the average of steering vectors derived from the adversarial images with $\epsilon$ values in {$\frac{16}{255}, \frac{32}{255}, \frac{64}{255}$, unconstrained}. Additional results for LLaVA-v1.5 can be found in Appendix, Fig. \ref{['fig:heatmap1']}.
  • Figure 4: Data used for steering vectors construction and test evaluation in the Jailbreak setup.
  • Figure 5: Prompt template.
  • ...and 8 more figures