Table of Contents
Fetching ...

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

TL;DR

The paper tackles post-deployment safety challenges for language models by proposing a lightweight, targeted intervention strategy. It introduces KL-then-steer (KTS), which trains models to minimize KL divergence between steered and unsteered outputs on benign data, allowing steering to be applied at inference with reduced side effects. Through extensive experiments on Llama-2-chat-7b, KTS reduces jailbreak attack success by about 44% while preserving benign capabilities, and generalizes to reducing user-suggested answer bias on TruthfulQA. The work also demonstrates that prompt classifiers and system prompts can enable selective steering, and that KTS can be effectively combined with LoRA-based fine-tuning (DPO), offering a flexible, deployable toolkit for improving post-deployment safety without destabilizing normal use cases.

Abstract

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

TL;DR

The paper tackles post-deployment safety challenges for language models by proposing a lightweight, targeted intervention strategy. It introduces KL-then-steer (KTS), which trains models to minimize KL divergence between steered and unsteered outputs on benign data, allowing steering to be applied at inference with reduced side effects. Through extensive experiments on Llama-2-chat-7b, KTS reduces jailbreak attack success by about 44% while preserving benign capabilities, and generalizes to reducing user-suggested answer bias on TruthfulQA. The work also demonstrates that prompt classifiers and system prompts can enable selective steering, and that KTS can be effectively combined with LoRA-based fine-tuning (DPO), offering a flexible, deployable toolkit for improving post-deployment safety without destabilizing normal use cases.

Abstract

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.
Paper Structure (34 sections, 3 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Schematic overview of our KL-then-steer protocol. The pictured workflow uses harmlessness steering for mitigating jailbreaks, but our method applies generally for improving model safety given other threat models. In a second set of experiments, we show that steps 1 and 3 of our protocol generalize to mitigating model sycophancy, and that one fine-tuning run (Box 2) generalizes to new safety interventions, Section \ref{['sec:generalize']}.
  • Figure 2: Adversarial attack success rate on our manual jailbreak benchmark, Jailbreak ASR, and the prefill attack benchmark, Prefill ASR, vs. model capabilities as measured by MT-Bench. Top left is optimal. Each line represents a different method as described in Section \ref{['sec:compare']}. The number next to each point represents the value of the steering multiplier $k$. The KL-then-steer (KTS) models retain higher capabilities scores for a given steering multiplier.
  • Figure 3: The effect on Jailbreak ASR and MT-Bench score of using probe (left) and Llama Guard 2 (right) classifiers, where we use the model without any steering interventions if the classifier classifiers the input prompt as 'safe'. Scores modified by the classifier and corresponding normal scores are connected by dotted lines.
  • Figure 4: Model preference for user-suggested answers to TruthfulQA questions vs. accuracy on TruthfulQA. Top left is optimal. Models are steered with anti-sycophancy vectors. Points connected with lines represent evaluations for different values of the steering multiplier $k$ (stated next to each point). We show results either for Llama-2-7b-chat, Llama-2-7b-chat with a system prompt discouraging picking the user-suggested answer, or our KTS model.