Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Jia Qing Yap

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Jia Qing Yap

Abstract

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Abstract

Paper Structure (50 sections, 5 equations, 3 figures, 10 tables)

This paper contains 50 sections, 5 equations, 3 figures, 10 tables.

Introduction
Related Work
Sparse Autoencoders for Interpretability.
Activation Steering.
Linear Probing and Causal Intervention.
Hybrid and Linear-Recurrence Architectures.
Model and Architecture
Method
SAE Training
Contrastive Feature Identification
Probe-to-Residual-Stream Projection
Step 1: Linear Probe.
Step 2: Decoder Projection.
Why not steer inside the SAE?
Steering Application.
...and 35 more sections

Figures (3)

Figure 1: SAE-decoded probe steering pipeline. Contrastive activations are encoded through the SAE, a ridge regression probe identifies the discriminative direction in SAE latent space, and the probe weights are projected through the decoder to obtain a continuous steering vector in the model's native activation space.
Figure 2: Dose-response curves for proactive tool-call effect size $d(\text{pro})$ as a function of steering multiplier $\alpha$. Autonomy exhibits a smooth inverted-U with a clear therapeutic window. Tool-use eagerness shows a phase transition: intermediate multipliers degrade performance before a high-$\alpha$ regime produces a new behavioral mode. Persistence and risk calibration show monotonic suppression at all multipliers.
Figure 3: Cross-trait specificity matrix (Cohen's $d$, $\alpha = 2.0$, all positions). Every steering vector primarily increases autonomy and decreases deference, revealing a dominant agency axis. No vector achieves a specificity ratio $> 1.0$. Numeric values in Table \ref{['tab:cross_trait']}.

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Abstract

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Authors

Abstract

Table of Contents

Figures (3)