Table of Contents
Fetching ...

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda

TL;DR

This work probes how fine-tuning reshapes model representations by using crosscoders to decompose base and chat activations into a shared sparse dictionary of latents with model-specific decoders. It identifies two artifacts of the L1 training objective—Complete Shrinkage and Latent Decoupling—that can falsely label base-relevant concepts as chat-specific, and introduces Latent Scaling as a diagnostic tool. The authors show that BatchTopK crosscoders substantially mitigate these issues, yielding more genuinely chat-specific, interpretable latents and enabling more reliable causality assessments of chat behavior. They reveal interpretable latent concepts such as refusals, false information detection, and personal questions, and demonstrate template tokens play a central role in driving chat-specific behavior. Overall, the work advances practical best practices for crosscoder-based model diffing and demonstrates concrete insights into how chat-tuning alters model behavior, with implications for safety and alignment.

Abstract

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

TL;DR

This work probes how fine-tuning reshapes model representations by using crosscoders to decompose base and chat activations into a shared sparse dictionary of latents with model-specific decoders. It identifies two artifacts of the L1 training objective—Complete Shrinkage and Latent Decoupling—that can falsely label base-relevant concepts as chat-specific, and introduces Latent Scaling as a diagnostic tool. The authors show that BatchTopK crosscoders substantially mitigate these issues, yielding more genuinely chat-specific, interpretable latents and enabling more reliable causality assessments of chat behavior. They reveal interpretable latent concepts such as refusals, false information detection, and personal questions, and demonstrate template tokens play a central role in driving chat-specific behavior. Overall, the work advances practical best practices for crosscoder-based model diffing and demonstrates concrete insights into how chat-tuning alters model behavior, with implications for safety and alignment.

Abstract

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as and , along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

Paper Structure

This paper contains 55 sections, 34 equations, 49 figures, 6 tables.

Figures (49)

  • Figure 1: Histogram of decoder latent relative norm differences ($\Delta_\text{norm}$) between base and chat Gemma 2 2B models team2024gemma, for both the L1 crosscoder (left) and the BatchTopK crosscoder (right). A value of $1$ means the decoder vector of a latent for the base model is zero, indicating the latent is not useful for the base model (chat-only latents). A value of $0$ means the chat model's decoder vector has a norm of zero (base-only latents). Values around $0.5$ indicate similar decoder norms in both models, suggesting equal utility in both models (shared latents). We also show the chat-only latents that are truly chat-specific and that are not affected by Complete Shrinkage (error ratio $\nu^\varepsilon < 0.2$) and Latent Decoupling (reconstruction ratio $\nu^r < 0.5$) -- the chat-specific latents. Most of the L1 crosscoder chat-only latents suffer from these issues.
  • Figure 2: We compare how chat-only latents are affected by the issues described in \ref{['sec:issues']}. Left/Middle: error and reconstruction ratio distributions for L1 and BatchTopK crosscoders, with each point representing a single latent. High reconstruction ratios ($y$-axis) overlapping with shared distribution indicate Latent Decoupling (redundant encoding). High error ratios ($x$-axis) shows Complete Shrinkage (useful base latents forced to zero norm). Low values on both metrics (bottom left) identify truly chat-specific latents. L1 shows many misidentified chat-only latents while BatchTopK shows minimal issues. This means the $\Delta_\text{norm}$ successfully identifies chat-specific latents for $BatchTopK\xspace$ but fails for L1. Right: Count of latents below a range of $\nu$ thresholds ($x$-axis), comparing 3176 L1 chat-only latents versus top-3176 BatchTopK latents sorted by $\Delta_\text{norm}$.
  • Figure 3: Simplified illustration of our experimental setup for measuring latent causal importance. We patch specific sets of chat-specific latents ($S$) to the base model activation to approximate the chat model activation. The resulting approximation is then passed through the remaining layers of the chat model. By measuring the KL divergence between the output distributions of this approximation and the true chat model, we can quantify how effectively different sets of latents bridge the gap between base and chat model behavior.
  • Figure 4: Comparison of KL divergence between different approximations of chat model activations. Note the different $y$-axis scales - KL is generally much higher on the first 9 tokens. We establish baselines by replacing either None or All of the latents. We then evaluate the Latent Scaling metric against the relative norm difference ($\Delta_\text{norm}$) by comparing the effects of replacing the highest 50% (red) versus lowest 50% (green) of latents ranked by each metric. We show the 95% confidence intervals for all measurements. Our results reveal a critical difference between the crosscoders: while $\Delta_\text{norm}$ fails to identify causally important latents in the L1 crosscoder, where lower $\Delta_\text{norm}$ leads to smaller KL improvement, it successfully does so in the BatchTopK crosscoder. This confirms our hypothesis that $\Delta_\text{norm}$ is a meaningful metric in BatchTopK but merely a training artifact in L1. Using Latent Scaling, we successfully identify the most causal latents in L1, which is particularly evident in the first 9 tokens (right) where it almost matches BatchTopK. This shows that both crosscoder capture the behavioral difference similarly, BatchTopK avoids $\Delta_\text{norm}$ artifacts.
  • Figure 5: Autointerpretability detection scores (higher is better) across bins based on $rank(\nu^\varepsilon) + rank(\nu^r)$. Lower bins indicate lower $\nu$ values and more chat-specific latents. We compare the 3176 chat-only latents from the L1 crosscoder with the top-3176 latents by $\Delta_\text{norm}$ from the BatchTopK crosscoder.
  • ...and 44 more figures