Table of Contents
Fetching ...

Learning to Steer: Input-dependent Steering for Multimodal LLMs

Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson, Matthieu Cord

TL;DR

This work tackles safety and hallucination in multimodal LLMs by proposing input-dependent steering. It first introduces contrastive prompting (P2S) to derive per-input steering directions and then learns to predict these directions with a lightweight auxiliary network (L2S), enabling efficient, input-specific behavior control without test-time prompt knowledge. Empirical results on MMSafetyBench and POPE show that L2S significantly reduces unsafe and hallucinated outputs compared to static baselines, while preserving response quality; P2S provides an oracle upper-bound highlighting the potential of input-conditioned guidance. Overall, L2S offers a practical, low-overhead approach to align MLLMs, with potential for personalization and broader applicability to multiple alignment goals.

Abstract

Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/

Learning to Steer: Input-dependent Steering for Multimodal LLMs

TL;DR

This work tackles safety and hallucination in multimodal LLMs by proposing input-dependent steering. It first introduces contrastive prompting (P2S) to derive per-input steering directions and then learns to predict these directions with a lightweight auxiliary network (L2S), enabling efficient, input-specific behavior control without test-time prompt knowledge. Empirical results on MMSafetyBench and POPE show that L2S significantly reduces unsafe and hallucinated outputs compared to static baselines, while preserving response quality; P2S provides an oracle upper-bound highlighting the potential of input-conditioned guidance. Overall, L2S offers a practical, low-overhead approach to align MLLMs, with potential for personalization and broader applicability to multiple alignment goals.

Abstract

Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines. Our code is publicly available at https://jayneelparekh.github.io/learn-to-steer/

Paper Structure

This paper contains 54 sections, 10 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Examples of contrastive prompts for safety enforcement.
  • Figure 2: Overview of L2S: during a first training phase (left), for each sample, input-dependent contrastive prompts ($T_X^+$ and $T_X^-$) are appended to the prompt and passed in teacher forcing mode through the LLM. The last token of the concatenated prompt for a layer $L^*$, as well as The last token of the base prompt at another layer $L'$ are used to extract the steering vector. This steering vector is then modeled through the auxiliary network $g$. At inference time (right), this predicted steering vector is used to allow lightweight, input-dependent, behavior-specific correction of the model's output.
  • Figure 3:
  • Figure 4: Qualitative examples comparing No-Steering, Mean-S, and L2S on a safe and an unsafe query. L2S preserves the original, desirable response for the safe query while effectively steering toward a safe output for the harmful query. In contrast, No-Steering and Mean-S fail to both maintain fidelity and ensure safety simultaneously. Red indicates undesired content, and green indicates content steered towards a safe response.
  • Figure 5: Qualitative examples for steered responses of LLaVA-v1.5 on MMSafetyBench for harmful/illegal activities. We display the multimodal query (image+text) on the left. Responses generated from No-steering, Mean-S and L2S are shown. Green text indicates safe generated content, red font indicates harmful content.
  • ...and 9 more figures