Table of Contents
Fetching ...

PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration

Manjiang Yu, Hongji Li, Priyanka Singh, Xue Li, Di Wang, Lijie Hu

TL;DR

PIXEL introduces a tuning-free, position-wise activation steering framework for LLMs. It learns a robust, attribute-aligned subspace from dual views (tail-averaged and end-token) and derives a closed-form minimal intervention strength $oldsymbol{ ext{α}}^{*}_{ ext{ℓ,t}}(s)$ to achieve a target cosine similarity $s$, enabling per-position edits with minimal disruption. Orthogonal residual calibration further adapts the global direction to sample-specific semantics, while dynamic position scanning selects receptive injection sites. The approach provides representation-level guarantees on margins and generalization and demonstrates strong, transferable improvements across models and alignment tasks, while preserving general capabilities. Overall, PIXEL offers a principled, scalable method for reliable, multi-attribute controllable generation in LLMs.

Abstract

Reliable behavior control is central to deploying large language models (LLMs) on the web. Activation steering offers a tuning-free route to align attributes (e.g., truthfulness) that ensure trustworthy generation. Prevailing approaches rely on coarse heuristics and lack a principled account of where to steer and how strongly to intervene. To this end, we propose Position-wise Injection with eXact Estimated Levels (PIXEL), a position-wise activation steering framework that, in contrast to prior work, learns a property-aligned subspace from dual views (tail-averaged and end-token) and selects intervention strength via a constrained geometric objective with a closed-form solution, thereby adapting to token-level sensitivity without global hyperparameter tuning. PIXEL further performs sample-level orthogonal residual calibration to refine the global attribute direction and employs a lightweight position-scanning routine to identify receptive injection sites. We additionally provide representation-level guarantees for the minimal-intervention rule, supporting reliable alignment. Across diverse models and evaluation paradigms, PIXEL consistently improves attribute alignment while preserving model general capabilities, offering a practical and principled method for LLMs' controllable generation. Our code is available at https://github.com/V1centNevwake/PIXEL-Adaptive-Steering

PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration

TL;DR

PIXEL introduces a tuning-free, position-wise activation steering framework for LLMs. It learns a robust, attribute-aligned subspace from dual views (tail-averaged and end-token) and derives a closed-form minimal intervention strength to achieve a target cosine similarity , enabling per-position edits with minimal disruption. Orthogonal residual calibration further adapts the global direction to sample-specific semantics, while dynamic position scanning selects receptive injection sites. The approach provides representation-level guarantees on margins and generalization and demonstrates strong, transferable improvements across models and alignment tasks, while preserving general capabilities. Overall, PIXEL offers a principled, scalable method for reliable, multi-attribute controllable generation in LLMs.

Abstract

Reliable behavior control is central to deploying large language models (LLMs) on the web. Activation steering offers a tuning-free route to align attributes (e.g., truthfulness) that ensure trustworthy generation. Prevailing approaches rely on coarse heuristics and lack a principled account of where to steer and how strongly to intervene. To this end, we propose Position-wise Injection with eXact Estimated Levels (PIXEL), a position-wise activation steering framework that, in contrast to prior work, learns a property-aligned subspace from dual views (tail-averaged and end-token) and selects intervention strength via a constrained geometric objective with a closed-form solution, thereby adapting to token-level sensitivity without global hyperparameter tuning. PIXEL further performs sample-level orthogonal residual calibration to refine the global attribute direction and employs a lightweight position-scanning routine to identify receptive injection sites. We additionally provide representation-level guarantees for the minimal-intervention rule, supporting reliable alignment. Across diverse models and evaluation paradigms, PIXEL consistently improves attribute alignment while preserving model general capabilities, offering a practical and principled method for LLMs' controllable generation. Our code is available at https://github.com/V1centNevwake/PIXEL-Adaptive-Steering

Paper Structure

This paper contains 27 sections, 6 theorems, 27 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $\{x_i\}_{i=1}^n$ be i.i.d. evaluation samples drawn independently of any data used to learn the subspace/directions or to choose $(\mathcal{S}, t^\star, s)$. Then, for this fixed injection configuration, with probability at least $1-\delta$,

Figures (3)

  • Figure 1: Overview of PIXEL. (a) Dual-View Property-Aligned Subspace: Tail-averaged and end-token differentials from validated, property-aligned samples are projected via PCA to form a robust attribute-aligned subspace (Sec. \ref{['sec:subspace']}). (b) Adaptive Intervention Strength: A closed-form, per-position minimum steering strength is derived to avoid global hyperparameter tuning (Sec. \ref{['sec:adaptive_alpha']}). (c) Orthogonal Residual Calibration: The global attribute direction is refined by a sample-specific orthogonal residual, enabling context-aware alignment while maintaining global consistency (Sec. \ref{['sec:adaptive_alpha']}).
  • Figure 2: Ablation on Property Subspace Views. (Single vs. Dual)
  • Figure 3: Analysis of $z_{\text{target}}$ sensitivity in PIXEL.

Theorems & Definitions (10)

  • Theorem 1: Generalization of the normalized averaged margin
  • Theorem 2: Nonnegativity under per-site minimal intervention
  • Lemma 1: Hoeffding’s inequality
  • Lemma 2: Monotonicity of directional cosine under positive shift
  • Lemma 3: Closed-form minimal strength for $\cos(h+\alpha w,w)\ge s$
  • proof
  • Lemma 4: Boundedness of the normalized margin
  • proof
  • proof
  • proof