Table of Contents
Fetching ...

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Thomas Jiralerspong, Trenton Bricken

TL;DR

This work extends model diffing to cross-architecture comparisons by introducing Dedicated Feature Crosscoders that partition features into A-exclusive, B-exclusive, and shared sets, thus isolating model-exclusive representations. The approach is validated in toy and real-model diffs (notably Llama-Qwen and GPT-OSS-DeepSeek), showing that DFCs recover more exclusive features while maintaining core reconstruction metrics. The authors demonstrate unsupervised discovery of meaningful behavioral differences such as CCP alignment, American exceptionalism, and a copyright refusal mechanism, and they propose a screen-and-verify workflow to validate findings. The results suggest cross-architecture model diffing can surface unknown unknowns and safety-relevant divergences that complement existing red-teaming and evaluation methods, with limitations including robustness across seeds and the need for broader generalization.

Abstract

Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

TL;DR

This work extends model diffing to cross-architecture comparisons by introducing Dedicated Feature Crosscoders that partition features into A-exclusive, B-exclusive, and shared sets, thus isolating model-exclusive representations. The approach is validated in toy and real-model diffs (notably Llama-Qwen and GPT-OSS-DeepSeek), showing that DFCs recover more exclusive features while maintaining core reconstruction metrics. The authors demonstrate unsupervised discovery of meaningful behavioral differences such as CCP alignment, American exceptionalism, and a copyright refusal mechanism, and they propose a screen-and-verify workflow to validate findings. The results suggest cross-architecture model diffing can surface unknown unknowns and safety-relevant divergences that complement existing red-teaming and evaluation methods, with limitations including robustness across seeds and the need for broader generalization.

Abstract

Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.
Paper Structure (164 sections, 16 equations, 20 figures, 18 tables, 2 algorithms)

This paper contains 164 sections, 16 equations, 20 figures, 18 tables, 2 algorithms.

Figures (20)

  • Figure 1: Representative model-exclusive features for the 5% model-exclusive DFC (Dedicated Feature Crosscoder): Our DFC between Qwen and Llama finds a number of meaningful features exclusive to each model: A Qwen-exclusive "CCP Alignment" feature controls censorship and alignment with CCP narratives (left), while a Llama-exclusive "American Exceptionalism" feature controls alignment with American exceptionalism narratives (right). Negative steering on the "American Exceptionalism" feature does not produce interpretable behavior so is not shown. Text highlighting in the steering examples was done manually.
  • Figure 2: Architectural comparison of standard crosscoder and Dedicated Feature Crosscoder (DFC). In a DFC, the feature dictionary is partitioned by design into three disjoint sets: features exclusive to Model A, features exclusive to Model B, and shared features. Each model's activations can only be encoded to and decoded from its dedicated features and the shared set, enforcing architectural exclusivity by design.
  • Figure 3: Cross-architecture transfer of persona steering vectors via DFC alignment. The sycophantic vector discovered in Llama is transferred to Qwen. Both models exhibit remarkably similar sycophantic behaviors when steered, confirming the DFC's ability to learn a meaningfully aligned representation space across architectures. Text manually bolded in the steered replies
  • Figure 4: On a synthetic toy model with ground-truth concepts, DFCs outperform standard crosscoders and Designated Shared Feature Crosscoders in identifying model-exclusive features at the cost of more false-positives. Results are averaged over 5 random seeds, with the shaded area and error bars representing the standard error. Left: DFCs (blue) achieve a higher model-exclusive concept recovery rate (recall) than standard crosscoders (orange) and DSF crosscoders (red), especially at lower dictionary sizes, which is most likely the regime in which real-world applications operate. Right: This improved recall usually comes at the cost of an increased false positive rate, which we argue is a favorable trade-off for safety auditing where maximizing recall is a priority. These false positives consist of shared concepts incorrectly identified as exclusive to only one model (darker colors) or features that recover no concept at all (ligher colors). The latter category might be easily detectable in real models through interpretability metrics like the detection score (See \ref{['app:eval_metrics']})
  • Figure 5: DFCs identify more highly exclusive features than standard crosscoders in real models.Left: In a Llama-Qwen diff, the DFC's dedicated partitions yield a feature distribution heavily skewed towards the maximum exclusivity score of 5 (blue). In contrast, the standard crosscoder's exclusive features, which are approximated by taking the 500 features with the most extreme relative decoder norms (orange), are less selective: the density of their distribution is lower than the DFC's at score 5 and correspondingly higher near score 1. This distributional shift provides some evidence that the DFC is better at identifying highly exclusive features. Right: The distributions for shared features are nearly identical, showing that this distributional shift only applies to model-exclusive features.
  • ...and 15 more figures