Table of Contents
Fetching ...

Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

Huiyuan Tian, Bonan Xu, Shijian Li

TL;DR

Vision Transformers exhibit a two-phase information processing dynamic, characterized by an initial compression followed by task-specific expansion, which creates a fundamental representational mismatch with smaller students. The distillation dynamics framework combines frequency, entropy, and activation analyses to explain why naive feature mimicry often harms ViT performance and to guide principled compression strategies. Empirical results show that late-layer feature distillation yields negative transfer, while early-layer or spectral-domain alignment can offer marginal gains, highlighting the need for phase-aware distillation and representation translation. This work provides theoretical guidance for designing ViT compression methods and clarifies the conditions under which feature-based KD is beneficial or detrimental in transformer-based vision models.

Abstract

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as "distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided at https://github.com/thy960112/Distillation-Dynamics.

Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

TL;DR

Vision Transformers exhibit a two-phase information processing dynamic, characterized by an initial compression followed by task-specific expansion, which creates a fundamental representational mismatch with smaller students. The distillation dynamics framework combines frequency, entropy, and activation analyses to explain why naive feature mimicry often harms ViT performance and to guide principled compression strategies. Empirical results show that late-layer feature distillation yields negative transfer, while early-layer or spectral-domain alignment can offer marginal gains, highlighting the need for phase-aware distillation and representation translation. This work provides theoretical guidance for designing ViT compression methods and clarifies the conditions under which feature-based KD is beneficial or detrimental in transformer-based vision models.

Abstract

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as "distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided at https://github.com/thy960112/Distillation-Dynamics.

Paper Structure

This paper contains 31 sections, 12 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Layer-wise Shannon entropy of the CaiT-S24 teacher model exhibits a characteristic U-shaped profile. Entropy decreases from layers 1-9 (compression phase), then increases through layer 24 (expansion phase).
  • Figure 2: U-shaped profile of average activation magnitude through network layers.
  • Figure 3: Frequency spectra of channel-wise feature representations across layers in CaiT-S24. Early layers (a-b) show uniform, noisy spectra; middle layers (c-d) exhibit low-pass filter characteristics corresponding to the representational bottleneck; late layers (e-f) return to uniform spectra with higher energy, indicating distributed high-dimensional encoding. This three-phase evolution aligns with the U-shaped information processing profile in ViTs.
  • Figure 4: Distillation Evolution.
  • Figure 5: Layer-wise Shannon entropy for a standard ViT trained with supervised learning. The model displays the characteristic U-shaped profile, indicating an initial compression phase followed by an expansion phase.
  • ...and 2 more figures