Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy, Mostafa Elhoushi, Amr Alanwar
TL;DR
The paper tackles the problem of unsafe LLM behavior without expensive retraining by introducing Weighted Activation Steering (WAS), an inference-time mechanism that uses a lightweight controller to produce a global scale $s$ and per-layer weights $w$ to modulate a precomputed refusal direction $\boldsymbol{d}_{steer}$ via patches $\Delta h = s \cdot w \cdot \boldsymbol{d}_{steer} \cdot \alpha_{global}$. The controller is trained discriminatively on cached activations from harmful and benign prompts, and patches are applied across transformer layers during decoding, enabling targeted safety interventions while preserving benign performance. Evaluations on Llama-3.1-8B, Llama-3.2-1B, and Mistral-7B show substantial increases in refusal rates on ToxicChat, Jailbreak, and AdvBench benchmarks with minimal degradation to general capabilities, albeit with some inference-time overhead. WAS demonstrates that an efficient, adaptive, layer-aware activation strategy can provide robust, inference-time safety improvements without altering base-model parameters, offering a practical complement to traditional fine-tuning approaches. The work suggests a path toward more flexible, deployment-friendly alignment tools, while highlighting considerations for robustness, generalization, and ethical use in safeguarding AI systems.
Abstract
Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.
