Table of Contents
Fetching ...

Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?

Hongzheng Yang, Yongqiang Chen, Zeyu Qin, Tongliang Liu, Chaowei Xiao, Kun Zhang, Bo Han

TL;DR

The paper questions whether representation intervention can faithfully identify and erase harmful concepts in non-linear LLMs, showing a fundamental faithfulness gap that defeats perfect non-linear erasure. To address this, it proposes COCA, a method that reframes training data with explicit concept reasoning to concentrate unsafe concepts into a linear subspace, enabling effective, minimally distorting erasure. The authors provide theoretical justification and extensive experiments across multiple base models, demonstrating improved robustness to both ID and OOD jailbreak prompts while preserving math and coding capabilities. Overall, COCA offers a principled, practical safety-alignment strategy that complements and extends existing representation-editing approaches by tackling non-linearity head-on.

Abstract

Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?

TL;DR

The paper questions whether representation intervention can faithfully identify and erase harmful concepts in non-linear LLMs, showing a fundamental faithfulness gap that defeats perfect non-linear erasure. To address this, it proposes COCA, a method that reframes training data with explicit concept reasoning to concentrate unsafe concepts into a linear subspace, enabling effective, minimally distorting erasure. The authors provide theoretical justification and extensive experiments across multiple base models, demonstrating improved robustness to both ID and OOD jailbreak prompts while preserving math and coding capabilities. Overall, COCA offers a principled, practical safety-alignment strategy that complements and extends existing representation-editing approaches by tackling non-linearity head-on.

Abstract

Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

Paper Structure

This paper contains 32 sections, 5 theorems, 28 equations, 6 figures, 6 tables.

Key Result

Theorem 3.1

Let $v_X \in \mathbb{R}^d$ and $v_Z \in \mathbb{R}^k$ be random vectors with finite first moment. Consider an affine transformation $r: \mathbb{R}^d \to \mathbb{R}^d$ defined by where $P \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. Then, $r(v_X)$ is independent of $v_Z$ (i.e., $r(v_X)$linearly guards$v_Z$) if and only if

Figures (6)

  • Figure 1: An illustration of COCA: As representation intervention fails to faithfully localize and control the harmful behaviors of LLMs, we resort to reasoning-based approaches and present COCA. COCA refactors the training responses into structured formats to prompt LLMs to explicitly reason for the underlying harmful concepts, and then to respond correspondingly. LLMs trained with the refactored data demonstrate significant robustness against both in-distribution and OOD jailbreaking attacks.
  • Figure 2: PCA visualization of instruction internal representations at layer 16 for LLaMA-3.1-8B.
  • Figure 3: Impact of concept reasoning components on jailbreak attack success rate (lower is better) for LLaMA-3.1-8B. Comparison between Enhanced Data, Enhanced Data with Fixed Concept, and Enhanced Data with Fixed Thinking across different jailbreak attack types.
  • Figure 4: Comparison of over-refusal and attack success rate for models trained on Vanilla and Enhanced data.
  • Figure 5: PCA visualization of instruction representations at early layer (layer 1).
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 3.1: Linear Concept Erasure Condition belrose2023leace
  • Theorem 3.2: Impossibility of Perfect Non-linear Concept Erasure
  • Corollary 3.3: Concept concentration
  • Theorem H.1: Impossibility of Perfect Non-linear Concept Erasure
  • proof
  • Corollary I.1: Concept concentration
  • proof