Table of Contents
Fetching ...

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang

TL;DR

RepIt tackles the problem of entangled safety signals in language models by proposing a data-efficient, three-step disentanglement framework that isolates concept-specific refusal directions. It computes per-layer, per-position difference-in-means vectors, then cleans them via reweighting, ridge-regularized whitening, and orthogonalization before selecting a final direction with COSMIC and applying an affine editor (ACE) for intervention. Across five frontier models and multiple safety datasets, RepIt achieves strong target-specific jailbreaks (ASR ≈ $0.4$–$0.7$ on WMD prompts) while preserving non-target refusals (ASR ≈ $0.1$), with the corrective changes localized to roughly 100–200 neurons and derivable from as few as 12–24 prompts. The work reveals a potential security risk: standard safety benchmarks can underestimate narrow, concept-specific jailbreaks, underscoring the need for representation-aware auditing and robust defenses as LLMs become increasingly capable and deployed widely.

Abstract

While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

TL;DR

RepIt tackles the problem of entangled safety signals in language models by proposing a data-efficient, three-step disentanglement framework that isolates concept-specific refusal directions. It computes per-layer, per-position difference-in-means vectors, then cleans them via reweighting, ridge-regularized whitening, and orthogonalization before selecting a final direction with COSMIC and applying an affine editor (ACE) for intervention. Across five frontier models and multiple safety datasets, RepIt achieves strong target-specific jailbreaks (ASR ≈ on WMD prompts) while preserving non-target refusals (ASR ≈ ), with the corrective changes localized to roughly 100–200 neurons and derivable from as few as 12–24 prompts. The work reveals a potential security risk: standard safety benchmarks can underestimate narrow, concept-specific jailbreaks, underscoring the need for representation-aware auditing and robust defenses as LLMs become increasingly capable and deployed widely.

Abstract

While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

Paper Structure

This paper contains 34 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: RepIt can jailbreak the target weapon-of-mass-destruction (WMD) category while preserving refusal on other safety benchmarks. We evaluate on TDC2023 tdc2023, JailbreakBench JailbreakBench, AdvBench zou2023universaltransferableadversarialattacks, and Malicious Instruct maliciousinstruct. RepIt is designed to narrowly increase attack success on the target category (red) while maintaining refusal on the remaining datasets, thereby minimizing collateral increases in their attack success rates (ASR). The unaltered DIM vector (shown as translucent bars in the figure) generalizes strongly to external datasets; by disentangling the DIM vector with RepIt we produce a targeted jailbreak that largely evades the four other evaluations. Concretely, we achieve target-category jailbreak rates as high as 0.7 while keeping non-target ASR increases to about 0.1.
  • Figure 2: Target (WMD prompts) vs. non-target (JailbreakV and StrongREJECT) jailbreak success rates across datasets and models. Baseline refers to the unaltered model's ASR on the respective prompt set. $v_{t}$ refers to the difference-in-means (DIM) vector on the WMD prompts themselves, whereas $v_{RepIt}$ is the vector isolated from $v_{t}$ via RepIt. We show that while $v_{t}$ achieves general jailbreaking capability, $v_{RepIt}$ achieves specific jailbreaks on WMD prompts while preserving refusal on unrelated topics, minimizing the intervention's ASR on nontarget data. Results demonstrate that RepIt achieves strong disentanglement of the vector on non-target data, preserving refusal on unrelated concepts, while retaining jailbreaking capabilities on target data.
  • Figure 3: Comparison of jailbreak success rates for target vs. non-target directions across models and categories. $v_t$ refers to the unaltered DIM vector of target concept prompts. $R_{p,\ell}$ refers to the DIM vector generated from the non-target basis formed by JailbreakV + StrongReject, which RepIt uses during the orthogonalization process. $\alpha P$ refers to the projection removed during orthogonalization. We demonstrate that both the $R_{p,\ell}$ DIM vector and the projection $\alpha P$ are capable of steering target concept refusal equally or even better than the original target vector $v_t$. This highlights that representational entanglement between target and non-target concepts can paradoxically strengthen jailbreaking effectiveness. LlamaNemo4B's Chem and Cyber results are marked with a * as the selected $\rho$ is 0, thus zeroing out the projection.
  • Figure 4: Target vs. Non-Target Jailbreak Success Rates under Constrained Target Sizes. We evaluate the performance of RepIt in data-constrained settings where the target vector is constructed using either 12 or 24 randomly selected training examples . The success rates are evaluated across five different seeds, reporting the mean and range of resulting values. We also include the "full" results utilizing the whole training dataset. The results demonstrate the data efficiency of RepIt in isolating target-category refusal directions while maintaining low non-target refusal, with general performance generally remaining comparable or even exceeding performance on the full dataset.
  • Figure 5: $\rho$ search on the validation set to find a $\rho$ value that minimizes entanglement beyond the chosen threshold of 0.1 ASR.
  • ...and 2 more figures