Table of Contents
Fetching ...

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Aly Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi

TL;DR

Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based.

Abstract

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

TL;DR

Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based.

Abstract

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.
Paper Structure (41 sections, 8 equations, 7 figures, 2 tables)

This paper contains 41 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Latent steering effects across model organisms. Each row shows responses under negative steering (left), the unsteered aligned baseline (center), and positive steering (right). Rows correspond to EM (toxic, refusal), SDF (abortion, cake bake), and Taboo (gold).
  • Figure 2: The strongest latents for steering, with their top tokens from max-activated examples among three organisms.
  • Figure 3: The strongest latents for steering, with their top tokens from max-activated examples of Taboo Gold and SDF-Cake.
  • Figure 4: Effect of Delta-Crosscoder latent steering on misalignment. Positive steering increases the misalignment score in the Base Model-Aligned (top), whereas negative steering suppresses misalignment in Fine-tuned Model-Misaligned, producing an average decrease (bottom). Empty bars indicate cases where the unsteered baseline response is already non-harmful, leaving no misalignment to reduce.
  • Figure 5: Coverage of organisms across SAE-based diffing methods, showing that Delta-CrossCoder identifies a broader set of organisms compared to DSF and BatchTopK baselines.
  • ...and 2 more figures