SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Zirui He; Haiyan Zhao; Yiran Qiao; Fan Yang; Ali Payani; Jing Ma; Mengnan Du

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du

TL;DR

This work investigates the mechanisms of instruction following in large language models by introducing SAIF, a framework that uses sparse autoencoders to identify interpretable latent features tied to instruction adherence. It samples diverse linguistic variants of instructions, computes steering vectors from SAE latents, and applies calibrated residual-stream adjustments to steer outputs toward instruction compliance. Key findings show that instruction following is encoded by multiple, carefully chosen SAE latents, with the last Transformer layer playing a crucial role and post-instruction positioning enhancing steering effectiveness; the approach scales across model sizes and instruction types. The proposed method offers a lightweight, explainable means to steer LLMs and provides mechanistic insights that can inform future alignment and controllability research.

Abstract

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

TL;DR

Abstract

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)