Table of Contents
Fetching ...

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

Huiyuan Tian, Bonan Xu, Shijian Li, Xin Jin

TL;DR

The paper shows that Vision Transformers exhibit globally low-rank last-layer representations yet rely on high per-token encoding bandwidth, causing encoding mismatch with small students during feature-map KD. Through a two-view analysis—layer-wise SVD and token-level Spectral Energy Pattern (SEP)—the authors identify the mismatch and propose two minimal remedies: post-hoc feature lifting and native width alignment, which restore the effectiveness of simple feature-map KD on ImageNet-1K. Experimental results demonstrate substantial gains for a DeiT-Tiny student distilling from CaiT-S24 and improved standalone performance, validating the mismatch-driven design. This work provides actionable guidance for compact ViT design and distillation strategies, highlighting the importance of token-level encoding capacity alongside global subspace structure.

Abstract

Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99\%/95\%/90\%/80\%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86\%$ to $77.53\%$ and $78.23\%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

TL;DR

The paper shows that Vision Transformers exhibit globally low-rank last-layer representations yet rely on high per-token encoding bandwidth, causing encoding mismatch with small students during feature-map KD. Through a two-view analysis—layer-wise SVD and token-level Spectral Energy Pattern (SEP)—the authors identify the mismatch and propose two minimal remedies: post-hoc feature lifting and native width alignment, which restore the effectiveness of simple feature-map KD on ImageNet-1K. Experimental results demonstrate substantial gains for a DeiT-Tiny student distilling from CaiT-S24 and improved standalone performance, validating the mismatch-driven design. This work provides actionable guidance for compact ViT design and distillation strategies, highlighting the importance of token-level encoding capacity alongside global subspace structure.

Abstract

Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only dimensions suffice to capture of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from to and when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.

Paper Structure

This paper contains 23 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Global low-rank structure of CaiT-S24 touvron2021cait. (a) Layer-wise effective dimension (minimal rank) required to recover $99\%$ of the feature energy for CaiT-S24 on ImageNet-1K, averaged over 1000 validation images. The required rank follows a clear hump across depth and is substantially below the channel width (384) at all the last layers, indicating a globally low-rank representation. (b)--(e) Histograms of the minimal last-layer rank needed to recover $80\%$, $90\%$, $95\%$, and $99\%$ of the energy, respectively. The 99th-percentile ranks are 14, 34, 61, and 121, confirming that a single shared linear projector of modest dimension can preserve almost all last-layer energy for nearly all inputs.
  • Figure 2: Token-level Spectral Energy Pattern (SEP) across ViT architectures. Cumulative spectral energy of last-layer tokens as a function of normalized spectral bandwidth $d/D'$ for several Vision Transformers (ViT-Tiny, CaiT-S24, DeiT-Small, ViT-Large, ViT-Huge, Swin-Small), averaged over 1000 ImageNet-1K validation images. All models follow nearly identical, almost diagonal SEP curves: capturing $50\%$, $70\%$, or $90\%$ of a token's energy requires roughly $50\%$, $70\%$, or $90\%$ of the available frequency bins, respectively. This universality shows that individual tokens spread their energy broadly over channel modes, indicating high per-token utilization despite the globally low-rank structure revealed by SVD.
  • Figure 3: Singular value decomposition (SVD) analysis of DeiT-Small. (a) Layer-wise effective dimension required to recover $99\%$ of the feature energy (mean, ±1 std, and interquartile range). (b–e) Histograms of last-layer dimensions needed to restore different fractions of the energy ($80\%, 90\%, 95\%, 99\%$). The final layer is strongly low-rank compared to the full channel dimension.
  • Figure 4: SVD analysis of Swin-Small. (a) Stage-wise effective dimension required for $99\%$ energy restoration across the four Swin stages (mean, ±1 std, and interquartile range). (b–e) Histograms of the last stage’s effective dimensions at multiple energy thresholds. Despite its hierarchical, windowed design, the last-stage representation is also globally low-rank.
  • Figure 5: SVD analysis of ViT-Huge. (a) Layer-wise effective dimension required to preserve $99\%$ of the feature energy over $32$ transformer layers. (b–e) Histograms of last-layer dimensions for different energy thresholds, showing that only a small fraction of the $1280$ channels is actually used to capture almost all variance.
  • ...and 3 more figures