Table of Contents
Fetching ...

SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis

Huiyuan Tian, Bonan Xu, Shijian Li, Gang Pan

TL;DR

SpectralKD introduces a unified spectral-analysis framework to interpret Vision Transformers (ViTs) and enhance knowledge distillation (KD). By defining model-wise and layer-wise spectral measures, it uncovers a U-shaped information pattern in CaiT and cross-architecture spectral similarities between CaiT and Swin, guiding layer selection and feature alignment. A simple, parameter-free spectral alignment KD is proposed, using FFT-based frequency matching and adaptive channel alignment, achieving state-of-the-art ImageNet-1K performance without extra trainable parameters and demonstrating robust gains across DeiT and Swin models. The work further reveals distillation dynamics, showing that distilled students adopt teacher-like spectral patterns even in non-directly aligned layers, highlighting a new view on how KD shapes internal representations and generalization.

Abstract

Knowledge Distillation (KD) has achieved widespread success in compressing large Vision Transformers (ViTs), but a unified theoretical framework for both ViTs and KD is still lacking. In this paper, we propose SpectralKD, a novel unified analytical framework that offers deeper insights into ViTs and optimizes KD via spectral analysis. Our model-wise analysis reveals that CaiT concentrates information in their first and last few layers, informing optimal layer selection for KD. Surprisingly, our layer-wise analysis discovers that Swin Transformer and CaiT exhibit similar spectral encoding patterns despite their architectural differences, leading to feature map alignment guideline. Building on these insights, we propose a simple yet effective spectral alignment method for KD. Benefiting from the deeper understanding by above analysis results, even such a simple strategy achieves state-of-the-art performance on ImageNet-1K without introducing any trainable parameters, improving DeiT-Tiny by $+5.2\%$ and Swin-Tiny by $+1.4\%$ in top-1 accuracy. Furthermore, our post-training analysis reveals that distilled students can reproduce spectral patterns similar to their teachers, opening a new area we term ``distillation dynamics". Code and experimental logs are available in https://github.com/thy960112/SpectralKD.

SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis

TL;DR

SpectralKD introduces a unified spectral-analysis framework to interpret Vision Transformers (ViTs) and enhance knowledge distillation (KD). By defining model-wise and layer-wise spectral measures, it uncovers a U-shaped information pattern in CaiT and cross-architecture spectral similarities between CaiT and Swin, guiding layer selection and feature alignment. A simple, parameter-free spectral alignment KD is proposed, using FFT-based frequency matching and adaptive channel alignment, achieving state-of-the-art ImageNet-1K performance without extra trainable parameters and demonstrating robust gains across DeiT and Swin models. The work further reveals distillation dynamics, showing that distilled students adopt teacher-like spectral patterns even in non-directly aligned layers, highlighting a new view on how KD shapes internal representations and generalization.

Abstract

Knowledge Distillation (KD) has achieved widespread success in compressing large Vision Transformers (ViTs), but a unified theoretical framework for both ViTs and KD is still lacking. In this paper, we propose SpectralKD, a novel unified analytical framework that offers deeper insights into ViTs and optimizes KD via spectral analysis. Our model-wise analysis reveals that CaiT concentrates information in their first and last few layers, informing optimal layer selection for KD. Surprisingly, our layer-wise analysis discovers that Swin Transformer and CaiT exhibit similar spectral encoding patterns despite their architectural differences, leading to feature map alignment guideline. Building on these insights, we propose a simple yet effective spectral alignment method for KD. Benefiting from the deeper understanding by above analysis results, even such a simple strategy achieves state-of-the-art performance on ImageNet-1K without introducing any trainable parameters, improving DeiT-Tiny by and Swin-Tiny by in top-1 accuracy. Furthermore, our post-training analysis reveals that distilled students can reproduce spectral patterns similar to their teachers, opening a new area we term ``distillation dynamics". Code and experimental logs are available in https://github.com/thy960112/SpectralKD.

Paper Structure

This paper contains 22 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Model-wise frequency intensity analysis $L(\mathbf{X})$ (Equation \ref{['model_intensity']}), plotted across the Transformer depth for three models: (a) CaiT-S24 (teacher), (b) DeiT-Tiny without distillation, and (c) DeiT-Tiny distilled by our SpectralKD. SpectralKD clearly shifts the student's intensities in different layers closer to those of the teacher, especially in the most information-rich layers in the early and final few layers. Both CaiT-S24 and baseline DeiT-Tiny checkpoints are taken from the timm libraryrw2019timm.
  • Figure 2: Histogram of the model-wise frequency intensities $L(\mathbf{X})$ (Equation \ref{['model_intensity']}) for all $24$ layers of CaiT-S24. The distribution is heavily skewed: most layers have relatively low spectral intensity, while only a small number exhibit substantially higher values. These peaks point to potentially critical layers for knowledge distillation.
  • Figure 3: Layer-wise spectral intensity distributions $\mathbf{S}(\mathbf{X})$ (Equation \ref{['channel_intensity']}) for representative layers of CaiT-S24. The visualization reveals distinct encoding patterns across network depths. Early layers (a-b) exhibit approximately uniform intensities across frequencies. Middle layers (c-d) show a marked decay from low to high frequency. Final layers (e-f) once again become relatively uniform but at distinctly higher overall intensities.
  • Figure 4: Stage-wise spectral intensity distributions $\mathbf{S}(\mathbf{X})$ (Equation \ref{['channel_intensity']}) for Swin-Small. Despite the architectural contrasts with CaiT, the same trend emerges across its four stages. Early stages (a-b) show a decay from low to high frequency. Later stages (c-d) exhibit increasingly uniform and higher-intensity representations. This similarity suggests a shared spectral encoding strategy in deep Transformer models. Note: Each stage of Swin Transformer comprises multiple attention blocks, mapping to several layers in a uniform ViT model like CaiT.
  • Figure 5: Layer-wise spectral intensity distributions $\mathbf{S}(\mathbf{X})$ (Equation \ref{['channel_intensity']}) for three representative layers $(1, 6, 12)$ in the non-distilled baseline (top row, a–c) and the SpectralKD-distilled student (bottom row, d–f). Notably, the distilled student (DeiT-Tiny) exhibits teacher-like patterns (see Figure \ref{['fig:spectral_distribution']}) even in Layer $6$, which was not explicitly aligned during distillation. This underscores how selective alignment of early and final layers can induce broader teacher-like behavior in the student network.
  • ...and 4 more figures