Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek, Max Tegmark
TL;DR
This work investigates how distillation reshapes reasoning features in large language models by training a sparse crosscoder on Qwen-series models and their fine-tuned variants. It identifies unique reasoning feature directions in distilled models, including self-reflection and computation verification, and demonstrates their causal role through ablation and steering experiments. The study also reveals that larger distilled models develop more structured representations, as evidenced by improved parallelogram-geometry metrics, linking model size, distillation, and representation organization. Overall, the findings advance transparency and reliability in AI systems by elucidating how distillation alters internal reasoning features and their geometry.
Abstract
In this paper, we investigate how model distillation impacts the development of reasoning features in large language models (LLMs). To explore this, we train a crosscoder on Qwen-series models and their fine-tuned variants. Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and computation verification. Moreover, we observe that distilled models contain unique reasoning feature directions, which could be used to steer the model into over-thinking or incisive-thinking mode. In particular, we perform analysis on four specific reasoning categories: (a) self-reflection, (b) deductive reasoning, (c) alternative reasoning, and (d) contrastive reasoning. Finally, we examine the changes in feature geometry resulting from the distillation process and find indications that larger distilled models may develop more structured representations, which correlate with enhanced distillation performance. By providing insights into how distillation modifies the model, our study contributes to enhancing the transparency and reliability of AI systems.
