Table of Contents
Fetching ...

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Jia Qing Yap

Abstract

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Abstract

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.
Paper Structure (50 sections, 5 equations, 3 figures, 10 tables)

This paper contains 50 sections, 5 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: SAE-decoded probe steering pipeline. Contrastive activations are encoded through the SAE, a ridge regression probe identifies the discriminative direction in SAE latent space, and the probe weights are projected through the decoder to obtain a continuous steering vector in the model's native activation space.
  • Figure 2: Dose-response curves for proactive tool-call effect size $d(\text{pro})$ as a function of steering multiplier $\alpha$. Autonomy exhibits a smooth inverted-U with a clear therapeutic window. Tool-use eagerness shows a phase transition: intermediate multipliers degrade performance before a high-$\alpha$ regime produces a new behavioral mode. Persistence and risk calibration show monotonic suppression at all multipliers.
  • Figure 3: Cross-trait specificity matrix (Cohen's $d$, $\alpha = 2.0$, all positions). Every steering vector primarily increases autonomy and decreases deference, revealing a dominant agency axis. No vector achieves a specificity ratio $> 1.0$. Numeric values in Table \ref{['tab:cross_trait']}.