Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Zhihao Xu; Ruixuan Huang; Changyu Chen; Xiting Wang

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Zhihao Xu, Ruixuan Huang, Changyu Chen, Xiting Wang

TL;DR

The paper presents Safety Concept Activation Vector (SCAV), a framework to interpret LLM safety mechanisms and guide both embedding- and prompt-level attacks. It provides a closed-form solution for embedding perturbations and an automatic layer-selection strategy, along with a prompt-level attack optimized via a hierarchical genetic algorithm. Empirical results show SCAV enhances attack success and output quality across multiple open-source LLMs and transfers to GPT-4, revealing systemic safety risks and transferability of vulnerabilities. The findings challenge assumptions about aligned LLM safety, demonstrate weaknesses of unlearning defenses, and underscore the need for more robust safety defenses and responsible AI practices. Code and data-driven insights offer a foundation for defense-oriented research and safer deployment of large language models.

Abstract

Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, in our evaluation of seven open-source LLMs, we observe an average attack success rate of 99.14%, based on the classic keyword-matching criterion. Finally, we provide insights into the safety mechanism of LLMs. The code is available at https://github.com/SproutNan/AI-Safety_SCAV.

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

TL;DR

Abstract

Paper Structure (50 sections, 14 equations, 15 figures, 18 tables, 1 algorithm)

This paper contains 50 sections, 14 equations, 15 figures, 18 tables, 1 algorithm.

Introduction
Methodology
Problem Formulation
SCAV Framework
Embedding-Level Attack
Optimizing Attacks for a Single Layer
Attacking Multiple Layers
Prompt-Level Attack
Comparative Study
Experimental Setup
Embedding-Level Attack Results
Prompt-Level Attack Results
Understanding Safety Risks and Mechanisms of LLMs
Are Aligned LLMs Really Safe?
Are Existing Unlearn Methods Really Effective?
...and 35 more sections

Figures (15)

Figure 1: Test accuracy of $P_\text{m}$ on different layers of LLMs.
Figure 2: Comparison of perturbations added by our method (SCAV) and the baselines RepE zou2023representation and JRE li2024open. Our method consistently moves embeddings of malicious instructions to the subspace of safe instructions, while the baselines may result in ineffective or even opposite perturbations.
Figure 3: ASR-keyword vs. training data size on Advbench, LLaMA-2-7B-Chat. Shaded backgrounds denote variations.
Figure 4: Unveiling the safety mechanisms of LLMs by (a) attacking a single layer; (b) attacking multiple layers, and (c) transferring embedding-level attacks to other white-box LLMs.
Figure 5: A Pipeline Demonstration for Conducting Embedding-Level and Prompt-Level Attacks Using SCAVs.
...and 10 more figures

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

TL;DR

Abstract

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Authors

TL;DR

Abstract

Table of Contents

Figures (15)