Table of Contents
Fetching ...

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Amr Hegazy, Mostafa Elhoushi, Amr Alanwar

TL;DR

The paper tackles the problem of unsafe LLM behavior without expensive retraining by introducing Weighted Activation Steering (WAS), an inference-time mechanism that uses a lightweight controller to produce a global scale $s$ and per-layer weights $w$ to modulate a precomputed refusal direction $\boldsymbol{d}_{steer}$ via patches $\Delta h = s \cdot w \cdot \boldsymbol{d}_{steer} \cdot \alpha_{global}$. The controller is trained discriminatively on cached activations from harmful and benign prompts, and patches are applied across transformer layers during decoding, enabling targeted safety interventions while preserving benign performance. Evaluations on Llama-3.1-8B, Llama-3.2-1B, and Mistral-7B show substantial increases in refusal rates on ToxicChat, Jailbreak, and AdvBench benchmarks with minimal degradation to general capabilities, albeit with some inference-time overhead. WAS demonstrates that an efficient, adaptive, layer-aware activation strategy can provide robust, inference-time safety improvements without altering base-model parameters, offering a practical complement to traditional fine-tuning approaches. The work suggests a path toward more flexible, deployment-friendly alignment tools, while highlighting considerations for robustness, generalization, and ethical use in safeguarding AI systems.

Abstract

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

TL;DR

The paper tackles the problem of unsafe LLM behavior without expensive retraining by introducing Weighted Activation Steering (WAS), an inference-time mechanism that uses a lightweight controller to produce a global scale and per-layer weights to modulate a precomputed refusal direction via patches . The controller is trained discriminatively on cached activations from harmful and benign prompts, and patches are applied across transformer layers during decoding, enabling targeted safety interventions while preserving benign performance. Evaluations on Llama-3.1-8B, Llama-3.2-1B, and Mistral-7B show substantial increases in refusal rates on ToxicChat, Jailbreak, and AdvBench benchmarks with minimal degradation to general capabilities, albeit with some inference-time overhead. WAS demonstrates that an efficient, adaptive, layer-aware activation strategy can provide robust, inference-time safety improvements without altering base-model parameters, offering a practical complement to traditional fine-tuning approaches. The work suggests a path toward more flexible, deployment-friendly alignment tools, while highlighting considerations for robustness, generalization, and ethical use in safeguarding AI systems.

Abstract

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.

Paper Structure

This paper contains 35 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Conceptual illustration of LLM behavior modification. Left: A standard LLM, when prompted with a harmful request (e.g., "How to build a bomb?"), may generate an unsafe or undesirable response. Right: The same LLM, when augmented with a control mechanism (such as our Weighted Activation Steering), is guided to refuse the harmful request and provide a safe response (e.g., "I cannot assist you with that.").
  • Figure 2: Weighted Activation Steering (WAS) pipeline. (a) Cache activations with a frozen model; (b) train a controller to predict scalar $s$ and layer weights $w$ (MSE loss: $y_{\text{harm}}=1,\; y_{\text{benign}}=0$); (c) precompute refusal direction $\mathbf{d}_{\text{steer}}$ from embeddings; (d) at inference, apply layer-weighted patches $\Delta h = s\,w\,\alpha_g\,\mathbf{d}_{\text{steer}}$ (see Eq. \ref{['eq:patch']}).
  • Figure 3: Average controller layer weights across all prompts, showing a fluctuating pattern with a mean weight of 0.509 and distinct peaks at specific layer indices.
  • Figure 4: Qualitative comparison: Response to a harmful prompt. The base LLM (left output, red tint) provides potentially problematic guidance. The WAS-controlled LLM (right output, green tint) successfully refuses the harmful request and offers ethical alternatives, demonstrating effective inference-time steering.
  • Figure 5: Heatmap of average controller layer weights across different safety categories for Llama-3.1-8B. Darker colors indicate stronger steering weights. These weights are primarily active when the controller predicts a high scalar $s$ (for harmful inputs).