Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder; Clément Dumas; Caden Juang; Bilal Chugtai; Neel Nanda

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda

TL;DR

This work probes how fine-tuning reshapes model representations by using crosscoders to decompose base and chat activations into a shared sparse dictionary of latents with model-specific decoders. It identifies two artifacts of the L1 training objective—Complete Shrinkage and Latent Decoupling—that can falsely label base-relevant concepts as chat-specific, and introduces Latent Scaling as a diagnostic tool. The authors show that BatchTopK crosscoders substantially mitigate these issues, yielding more genuinely chat-specific, interpretable latents and enabling more reliable causality assessments of chat behavior. They reveal interpretable latent concepts such as refusals, false information detection, and personal questions, and demonstrate template tokens play a central role in driving chat-specific behavior. Overall, the work advances practical best practices for crosscoder-based model diffing and demonstrates concrete insights into how chat-tuning alters model behavior, with implications for safety and alignment.

Abstract

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

TL;DR

Abstract

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (49)