Evolution of SAE Features Across Layers in LLMs

Daniel Balcells; Benjamin Lerner; Michael Oesterle; Ediz Ucar; Stefan Heimersheim

Evolution of SAE Features Across Layers in LLMs

Daniel Balcells, Benjamin Lerner, Michael Oesterle, Ediz Ucar, Stefan Heimersheim

TL;DR

This work analyzes statistical relationships between features in adjacent layers to understand how features evolve through a forward pass and finds that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Abstract

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors (https://stefanhex.com/spar-2024/feature-browser/), and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Evolution of SAE Features Across Layers in LLMs

TL;DR

Abstract

Evolution of SAE Features Across Layers in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)