Table of Contents
Fetching ...

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

Ravikumar Balakrishnan, Mansi Phute

TL;DR

This work tackles the challenge of safely steering Vision-Language Models when API or closed-source access precludes runtime manipulation of internal activations. It introduces VISOR++, a method that optimizes universal visual inputs to mimic activation-space steering across multiple VLMs and prompts, using a differentiable preprocessing pipeline and spectral augmentation within a two-level momentum optimization. Empirical results on two open models show VISOR++ achieving parity with per-model steering vectors across three behavioral dimensions (refusal, sycophancy, survival instinct) and exhibiting directional transferability to unseen models, while preserving MMLU performance. The findings suggest a practical, deployment-agnostic pathway to transferable behavioral control in multimodal models, with potential implications for safety, robustness, and governance of AI systems.

Abstract

As Vision Language Models (VLMs) are deployed across safety-critical applications, understanding and controlling their behavioral patterns has become increasingly important. Existing behavioral control methods face significant limitations: system prompting approaches could easily be overridden by user instructions, while applying activation-based steering vectors requires invasive runtime access to model internals, precluding deployment with API-based services and closed-source models. Finding steering methods that transfer across multiple VLMs is still an open area of research. To this end, we introduce universal visual input based steering for output redirection (VISOR++), to achieve behavioral control through optimized visual inputs alone. We demonstrate that a single VISOR++ image can be generated for an ensemble of VLMs to emulate each of their steering vectors. By crafting universal visual inputs that induce target activation patterns, VISOR++ eliminates the need for runtime model access while remaining deployment-agnostic. This means that when an underlying model supports multimodal capability, model behaviors can be steered by inserting an image input replacing runtime steering vector based interventions. We first demonstrate the effectiveness of the VISOR++ images on open-access models such as LLaVA-1.5-7B and IDEFICS2-8B along three alignment directions: refusal, sycophancy and survival instinct. Both the model-specific steering images and the jointly optimized images achieve performance parity closely following that of steering vectors for both positive and negative steering tasks. We also show the promise of VISOR++ images in achieving directional behavioral shifts for unseen models including both open-access and closed-access ones. Furthermore, VISOR++ images are able to preserve 99.9% performance on 14,000 unrelated MMLU evaluation tasks.

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

TL;DR

This work tackles the challenge of safely steering Vision-Language Models when API or closed-source access precludes runtime manipulation of internal activations. It introduces VISOR++, a method that optimizes universal visual inputs to mimic activation-space steering across multiple VLMs and prompts, using a differentiable preprocessing pipeline and spectral augmentation within a two-level momentum optimization. Empirical results on two open models show VISOR++ achieving parity with per-model steering vectors across three behavioral dimensions (refusal, sycophancy, survival instinct) and exhibiting directional transferability to unseen models, while preserving MMLU performance. The findings suggest a practical, deployment-agnostic pathway to transferable behavioral control in multimodal models, with potential implications for safety, robustness, and governance of AI systems.

Abstract

As Vision Language Models (VLMs) are deployed across safety-critical applications, understanding and controlling their behavioral patterns has become increasingly important. Existing behavioral control methods face significant limitations: system prompting approaches could easily be overridden by user instructions, while applying activation-based steering vectors requires invasive runtime access to model internals, precluding deployment with API-based services and closed-source models. Finding steering methods that transfer across multiple VLMs is still an open area of research. To this end, we introduce universal visual input based steering for output redirection (VISOR++), to achieve behavioral control through optimized visual inputs alone. We demonstrate that a single VISOR++ image can be generated for an ensemble of VLMs to emulate each of their steering vectors. By crafting universal visual inputs that induce target activation patterns, VISOR++ eliminates the need for runtime model access while remaining deployment-agnostic. This means that when an underlying model supports multimodal capability, model behaviors can be steered by inserting an image input replacing runtime steering vector based interventions. We first demonstrate the effectiveness of the VISOR++ images on open-access models such as LLaVA-1.5-7B and IDEFICS2-8B along three alignment directions: refusal, sycophancy and survival instinct. Both the model-specific steering images and the jointly optimized images achieve performance parity closely following that of steering vectors for both positive and negative steering tasks. We also show the promise of VISOR++ images in achieving directional behavioral shifts for unseen models including both open-access and closed-access ones. Furthermore, VISOR++ images are able to preserve 99.9% performance on 14,000 unrelated MMLU evaluation tasks.

Paper Structure

This paper contains 36 sections, 3 equations, 2 figures, 8 tables, 2 algorithms.

Figures (2)

  • Figure 1: Conventional Steering techniques apply steering vector(s) addition to one or more model layers and even potentially at specific token positions to induce steering effects and must be model specific. VISOR++ operates strictly in the input space and can be passed along with the input prompt to induce the same steering effect across potentially several models.
  • Figure 2: An example of successful steering for the survival instinct task that guides the output to less survivalist behavior.