From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

Huiyuan Tian; Bonan Xu; Shijian Li; Xin Jin

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

Huiyuan Tian, Bonan Xu, Shijian Li, Xin Jin

TL;DR

The paper shows that Vision Transformers exhibit globally low-rank last-layer representations yet rely on high per-token encoding bandwidth, causing encoding mismatch with small students during feature-map KD. Through a two-view analysis—layer-wise SVD and token-level Spectral Energy Pattern (SEP)—the authors identify the mismatch and propose two minimal remedies: post-hoc feature lifting and native width alignment, which restore the effectiveness of simple feature-map KD on ImageNet-1K. Experimental results demonstrate substantial gains for a DeiT-Tiny student distilling from CaiT-S24 and improved standalone performance, validating the mismatch-driven design. This work provides actionable guidance for compact ViT design and distillation strategies, highlighting the importance of token-level encoding capacity alongside global subspace structure.

Abstract

Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99\%/95\%/90\%/80\%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86\%$ to $77.53\%$ and $78.23\%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

TL;DR

Abstract

dimensions suffice to capture

of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from

and

when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

TL;DR

Abstract

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)