SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis
Huiyuan Tian, Bonan Xu, Shijian Li, Gang Pan
TL;DR
SpectralKD introduces a unified spectral-analysis framework to interpret Vision Transformers (ViTs) and enhance knowledge distillation (KD). By defining model-wise and layer-wise spectral measures, it uncovers a U-shaped information pattern in CaiT and cross-architecture spectral similarities between CaiT and Swin, guiding layer selection and feature alignment. A simple, parameter-free spectral alignment KD is proposed, using FFT-based frequency matching and adaptive channel alignment, achieving state-of-the-art ImageNet-1K performance without extra trainable parameters and demonstrating robust gains across DeiT and Swin models. The work further reveals distillation dynamics, showing that distilled students adopt teacher-like spectral patterns even in non-directly aligned layers, highlighting a new view on how KD shapes internal representations and generalization.
Abstract
Knowledge Distillation (KD) has achieved widespread success in compressing large Vision Transformers (ViTs), but a unified theoretical framework for both ViTs and KD is still lacking. In this paper, we propose SpectralKD, a novel unified analytical framework that offers deeper insights into ViTs and optimizes KD via spectral analysis. Our model-wise analysis reveals that CaiT concentrates information in their first and last few layers, informing optimal layer selection for KD. Surprisingly, our layer-wise analysis discovers that Swin Transformer and CaiT exhibit similar spectral encoding patterns despite their architectural differences, leading to feature map alignment guideline. Building on these insights, we propose a simple yet effective spectral alignment method for KD. Benefiting from the deeper understanding by above analysis results, even such a simple strategy achieves state-of-the-art performance on ImageNet-1K without introducing any trainable parameters, improving DeiT-Tiny by $+5.2\%$ and Swin-Tiny by $+1.4\%$ in top-1 accuracy. Furthermore, our post-training analysis reveals that distilled students can reproduce spectral patterns similar to their teachers, opening a new area we term ``distillation dynamics". Code and experimental logs are available in https://github.com/thy960112/SpectralKD.
