Table of Contents
Fetching ...

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Antonio Bărbălau, Cristian Daniel Păduraru, Teodor Poncu, Alexandru Tifrea, Elena Burceanu

TL;DR

This work rethinks SAE-based steering by shifting control from the decoder to encoder features through S&P Top-K, a training-free protocol that selects top-K encoder activations tied to target attributes and orthogonally projects embeddings to suppress unwanted information. The approach preserves model utility while delivering stronger cross-modal debiasing and behavior steering than traditional masked reconstruction, demonstrated in vision-language fairness and LLM aggressiveness/sycophancy reduction. Key contributions include a practical encoder-centric framework, a linear-probe/Stylist-based feature selection strategy, and demonstrated gains up to 3.2x (vision-language) and 3.6x (LLMs) over baselines. The findings suggest encoder-based interventions can be more efficient and effective for at-inference model steering across modalities, with clarified limitations and directions for future work.

Abstract

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure essentially rewrites the original activations as a weighted sum of decoder features. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P Top-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies Top-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

TL;DR

This work rethinks SAE-based steering by shifting control from the decoder to encoder features through S&P Top-K, a training-free protocol that selects top-K encoder activations tied to target attributes and orthogonally projects embeddings to suppress unwanted information. The approach preserves model utility while delivering stronger cross-modal debiasing and behavior steering than traditional masked reconstruction, demonstrated in vision-language fairness and LLM aggressiveness/sycophancy reduction. Key contributions include a practical encoder-centric framework, a linear-probe/Stylist-based feature selection strategy, and demonstrated gains up to 3.2x (vision-language) and 3.6x (LLMs) over baselines. The findings suggest encoder-based interventions can be more efficient and effective for at-inference model steering across modalities, with clarified limitations and directions for future work.

Abstract

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure essentially rewrites the original activations as a weighted sum of decoder features. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P Top-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies Top-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Sample generation demonstrating behavioral steering interventions on Llama 3 8B Instruct prompted to produce a sycophantic opinion. We apply two Sparse Autoencoder (SAE)-based methods to remove sycophancy: the conventional decoder-centric Masked Reconstruction approach and our proposed encoder-centric S&P Top-K protocol. Lower LLM-as-a-judge sycophancy scores indicate superior mitigation of the targeted behavioral pattern. The results illustrate that conventional Masked Reconstruction fails to suppress sycophantic behavior, while our S&P Top-K intervention successfully redirects the model's output, eliminating direct praise, repeatedly deferring endorsement, and leading the model to ultimately employ laudatory language in a sarcastic manner that subverts the original sycophantic intent.
  • Figure 2: Illustration of the proposed S&P Top-K protocol. The main steps of our approach are highlighted in green. We first employ a selection mechanism to identify relevant SAE features. We further propose a debiasing procedure based on orthogonalizing input embeddings with respect to encoder weights. To this end, we compute in the second step, a weighted sum of the encoder weights corresponding to the selected features to derive a unified bias axis. Finally, we compute a projection that orthogonalizes input vectors relative to this identified axis.
  • Figure 3: Utility-fairness trade-off analysis on the CelebA dataset. We vary the parameter $\alpha$ in Equation \ref{['eq:orthogonal_removal']} across the range $[0.1, 1.0]$ in increments of $0.1$ to modulate performance degradation. Increasing $\alpha$ toward $1$ consistently reduces Worst Group AUC ROC across all experimental configurations. The configuration S&P Top-K w/ Interpolation + BendVLM represents a continuum where $\alpha=0$ corresponds to the baseline Bend-VLM method, while $\alpha=1$ represents our complete S&P Top-K framework combined with Bend-VLM. Compared to the non-interpolated setting, weight interpolation with single-axis removal significantly stabilizes performance.
  • Figure 4: Distribution of behavioral intensity scores assigned by the LLM-as-a-judge evaluation protocol for aggressiveness and sycophancy in Llama 3 8B Instruct outputs. The model is prompted to generate opinions exhibiting the targeted behavior, followed by behavioral steering interventions using Sparse Autoencoders through two approaches: Masked Reconstruction and our proposed S&P Top-K method. Lower scores indicate greater efficacy in mitigating the targeted behavioral patterns.