Table of Contents
Fetching ...

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du

TL;DR

This work investigates the mechanisms of instruction following in large language models by introducing SAIF, a framework that uses sparse autoencoders to identify interpretable latent features tied to instruction adherence. It samples diverse linguistic variants of instructions, computes steering vectors from SAE latents, and applies calibrated residual-stream adjustments to steer outputs toward instruction compliance. Key findings show that instruction following is encoded by multiple, carefully chosen SAE latents, with the last Transformer layer playing a crucial role and post-instruction positioning enhancing steering effectiveness; the approach scales across model sizes and instruction types. The proposed method offers a lightweight, explainable means to steer LLMs and provides mechanistic insights that can inform future alignment and controllability research.

Abstract

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

TL;DR

This work investigates the mechanisms of instruction following in large language models by introducing SAIF, a framework that uses sparse autoencoders to identify interpretable latent features tied to instruction adherence. It samples diverse linguistic variants of instructions, computes steering vectors from SAE latents, and applies calibrated residual-stream adjustments to steer outputs toward instruction compliance. Key findings show that instruction following is encoded by multiple, carefully chosen SAE latents, with the last Transformer layer playing a crucial role and post-instruction positioning enhancing steering effectiveness; the approach scales across model sizes and instruction types. The proposed method offers a lightweight, explainable means to steer LLMs and provides mechanistic insights that can inform future alignment and controllability research.

Abstract

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.

Paper Structure

This paper contains 27 sections, 6 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: The proposed SAIF framework. The model computes steering vectors from SAE latent differences to guide outputs according to instructions. (a) Extract steering vector. (b) Apply steering for controlled output.
  • Figure 2: Comparison of feature activation patterns between pre-instruction and post-instruction conditions across different SAE latent dimensions. The plots show three key metrics: activation strength (left), feature stability (middle), and activation probability (right) for eight identified instruction-relevant features.
  • Figure 3: The impact of the number of latent dimensions (k) on our steering experiments. The x-axis represents different values of k, while the y-axis records the accuracy. We track the trend of strict accuracy (SA) and loose accuracy (LA) across 8 different k values.
  • Figure 4: Examples of French translation task outcomes showing strict instruction following and loose instruction following using inputs in different languages. (Gemma-2-2b-it, SAE dimension of 65K)
  • Figure 5: Performance comparison between original model outputs and two steering approaches across different instruction types on Gemma-2-2b-it and Gemma-2-9b-it models. Results show the accuracy percentages for translation tasks (French, Chinese, English), keyword inclusion, and summarization tasks.
  • ...and 6 more figures