Table of Contents
Fetching ...

SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas

TL;DR

SteerVLM tackles the challenge of steering vision-language models toward targeted outputs without fine-tuning by introducing a lightweight, inference-time steering module that operates on latent activations. The module comprises a shared Steerer and SteeringGate placed after each language-model attention layer, using target and converse prompts to compute a delta added to activations: $z_l = x_l + \lambda \bar{x}_l$, with dimension-wise, token-specific control. A new multimodal dataset, VNIA, supports training and evaluation of steering in VLMs, and experiments show state-of-the-art zero-shot performance on hallucination mitigation (OHD) and strong topic-steering performance on VNIA, outperforming prior interventions by significant margins. The approach offers robust, generalizable multimodal model control with minimal parameter overhead, enabling safer and more controllable VLM outputs in real-time applications, while acknowledging limitations such as dataset synthesize and added forward-pass requirements.

Abstract

This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

TL;DR

SteerVLM tackles the challenge of steering vision-language models toward targeted outputs without fine-tuning by introducing a lightweight, inference-time steering module that operates on latent activations. The module comprises a shared Steerer and SteeringGate placed after each language-model attention layer, using target and converse prompts to compute a delta added to activations: , with dimension-wise, token-specific control. A new multimodal dataset, VNIA, supports training and evaluation of steering in VLMs, and experiments show state-of-the-art zero-shot performance on hallucination mitigation (OHD) and strong topic-steering performance on VNIA, outperforming prior interventions by significant margins. The approach offers robust, generalizable multimodal model control with minimal parameter overhead, enabling safer and more controllable VLM outputs in real-time applications, while acknowledging limitations such as dataset synthesize and added forward-pass requirements.

Abstract

This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

Paper Structure

This paper contains 39 sections, 6 equations, 11 figures, 18 tables.

Figures (11)

  • Figure 1: SteerVLM overview. We introduce a layer-agnostic steering module that adjusts the model's output towards a target prompt and away from a converse prompt.
  • Figure 2: The Steering Module. The steering module is hooked right after the multi-head attention module in each layer of the language decoder. The Steering module consists of the Steerer and the SteeringGate which steer the activations based on the context vectors. The steered activation is added to the residual.
  • Figure 3: Attention mask for the Steerer's Attention Block. $i$ denotes token at timestep 0 and $i+1$ denotes token at timestep 1. We make use of a boolean mask here where 1, 0 denote unmasked tokens and masked tokens respectively.
  • Figure 4: VNIA Dataset synthesis pipeline. We begin by generating target/converse prompt pairs. The prompts are then paired with images using CLIP-score matching with adaptive nucleus sampling for diversity. Finally, steered and unsteered responses are generated by Qwen2.5-VL-72B VLM.
  • Figure 5: Entropy threshold analysis to analyze trade-off between diversity and matching between image and prompt pairs.
  • ...and 6 more figures