Table of Contents
Fetching ...

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Soham Gadgil, Chris Lin, Su-In Lee

Abstract

Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Abstract

Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

Paper Structure

This paper contains 23 sections, 22 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Empirical analysis of input-specific steering layers for CAA. The top row shows boxplots of the difference between optimal-layer steerability and fixed-layer steerability for each dataset. The bottom row shows the distribution of optimal layers across inputs for each dataset, highlighting substantial variation in the most effective layer across inputs. Results are shown for both Llama-2-7B-Chat and Qwen-1.5-14B-Chat.
  • Figure 2: Overview of the proposed Where To Steer (W2S) framework. a. Training. The ground-truth layer is obtained by passing the input prompt through the frozen target LLM and selecting the layer that maximizes steerability for the given input. The predicted layer is produced by encoding the prompt using a frozen prompt encoder and feeding it to the W2S predictor. b. Inference. The input prompt is passed through the frozen prompt encoder and the trained W2S predictor to obtain the predicted optimal layer for steering. c. Steering. The steering vector is injected at the predicted layer, applied at the last token position with a scaling multiplier, to generate the steered LLM response.
  • Figure 3: Mean steerability for each target behavior comparing W2S to fixed-layer baselines. Error bars denote 95% confidence intervals computed over five runs.
  • Figure 4: Mean proportion of steerable examples for each target behavior comparing W2S to fixed-layer baselines. Error bars denote 95% confidence intervals computed over five runs.

Theorems & Definitions (3)

  • Example 1
  • Remark 1
  • Remark 2