Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu; Jake Ward; Thomas Icard; Christopher Potts

Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts

TL;DR

Insight into reasoning training is provided and transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning, is introduced.

Abstract

While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g., "wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

Transcoder Adapters for Reasoning-Model Diffing

TL;DR

Abstract

Paper Structure (31 sections, 7 equations, 20 figures, 3 tables)

This paper contains 31 sections, 7 equations, 20 figures, 3 tables.

Introduction
Related Work
Training Transcoder Adapters
Architecture
Training Objective
Experimental Setup
Evaluating Transcoder Adapters
Output Faithfulness
Internal Faithfulness
Benchmark Evaluation
Interpreting Transcoder Adapters
Automated Evaluations of Feature Interpretability
Transcoder Adapter Feature Classes
Attribution Graphs
Manipulating Transcoder Adapters
...and 16 more sections

Figures (20)

Figure 1: Transcoder adapters for model diffing. Transcoder adapters learn a sparse approximation of the difference in MLP computation before and after fine-tuning. Adapters are trained to reconstruct both internal activations and final output. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B.
Figure 2: Output faithfulness. Top-1 error and KL divergence against the target model. Transcoder adapters outperform baselines and approach the MLP fine-tuning skyline, achieving strong reconstruction even at very low sparsity ($L_0$ of 0.1–10).
Figure 3: Internal faithfulness. We evaluate internal faithfulness via (1) NMSE of hidden states against the target model and (2) KL divergence when replacing various subsets of target layers with adapter layers. Transcoder adapters outperform baselines across all metrics. Notably, KL divergence when replacing subsets of layers is always less than KL divergence when using adapters to approximate all layers, indicating internal errors do not accumulate beyond what is reflected in the final output.
Figure 4: Benchmark evaluation. Transcoder adapters match the target model's response lengths and recover much of the accuracy gains from reasoning fine-tuning. The remaining gap is comparable to the MLP fine-tuning skyline, suggesting it reflects training data differences rather than limitations of transcoder adapters. Despite sharing all non-MLP parameters with the target reasoning model, the hybrid baseline exhibits similar response length and accuracy to the base model.
Figure 5: Automated interpretability scores of transcoder adapters features and MLP neurons. Max-activating examples of adapter features achieve slightly higher detection accuracy than neurons, while uniformly sampled activating examples score below neurons but well above random chance (0.50).
...and 15 more figures

Transcoder Adapters for Reasoning-Model Diffing

TL;DR

Abstract

Transcoder Adapters for Reasoning-Model Diffing

Authors

TL;DR

Abstract

Table of Contents

Figures (20)