Table of Contents
Fetching ...

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu

TL;DR

DeiT-LT introduces a distillation-based framework to train Vision Transformers from scratch on long-tailed datasets by leveraging CNN priors through out-of-distribution (OOD) distillation and by enforcing low-rank, generalizable features via SAM-trained teachers. The model implements a dual-expert token setup (CLS for head, DIST for tail) and uses Deferred Re-Weighting to emphasize tail classes in the distillation loss, achieving CNN-like locality in early ViT layers and improved tail-class generalization. Across CIFAR-10/100 LT, ImageNet-LT, and iNaturalist-2018, DeiT-LT delivers substantial gains over strong baselines, including near-SotA performance on small datasets without pretraining. The work demonstrates that ViTs can reach strong long-tailed performance by distilling from CNNs and by guiding representation learning with low-rank, tail-focused features, broadening ViT applicability across domains without reliance on large-scale pretraining.

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

TL;DR

DeiT-LT introduces a distillation-based framework to train Vision Transformers from scratch on long-tailed datasets by leveraging CNN priors through out-of-distribution (OOD) distillation and by enforcing low-rank, generalizable features via SAM-trained teachers. The model implements a dual-expert token setup (CLS for head, DIST for tail) and uses Deferred Re-Weighting to emphasize tail classes in the distillation loss, achieving CNN-like locality in early ViT layers and improved tail-class generalization. Across CIFAR-10/100 LT, ImageNet-LT, and iNaturalist-2018, DeiT-LT delivers substantial gains over strong baselines, including near-SotA performance on small datasets without pretraining. The work demonstrates that ViTs can reach strong long-tailed performance by distilling from CNNs and by guiding representation learning with low-rank, tail-focused features, broadening ViT applicability across domains without reliance on large-scale pretraining.

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.
Paper Structure (24 sections, 5 equations, 10 figures, 10 tables)

This paper contains 24 sections, 5 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: We propose DeiT-LT (Fig. \ref{['fig:overview']}, a distillation scheme for Vision Transformer (ViT), tailored towards long-tailed data). In Deit-LT, a) we introduce OOD distillation from CNN, which leads to learning local generalizable features in early blocks. b) we propose to distill from teachers trained via SAM foret2020sharpness which induces low-rank features across blocks in ViT to improve generalization. c) In comparison to other SotA ViT baselines, Deit-LT (ours) demonstrates significantly improved performance for minority classes, with increasing imbalance.
  • Figure 2: Overview of DeiT-LT. The Head Expert classifier trains using CE loss against ground truth, whereas the Tail Expert classifier trains using DRW loss against hard-distillation targets from the flat ResNet teacher trained via SAM foret2020sharpness. The distillation is performed using out-of-distribution images created using strong augmentations and Mixup.
  • Figure 3: Entropy of teacher outputs: Comparison of the entropy of in-distribution samples and out-of-distribution samples with the ResNet-32 teacher on CIFAR-10 LT. We observe a higher accuracy in Table-\ref{['tab:augs']} corresponding to out-of-distribution samples.
  • Figure 4: Effect of Distillation in DeiT-LT. In a) we train DeiT-B with teachers trained on in-distribution images (RegNetY-16GF) and out-of-distribution images (ResNet32). The out-of-distribution distillation leads to diverse experts, which become more diverse with deferred re-weighting on the distillation token (DRW). In b) we plot the Mean Attention Distance for the patches across the early self attention block 1 (solid) and block 2 (dashed) for baselines, where we find that DeiT-LT leads to highly local and generalizable features. In c) we show the rank of features for DIST token, where we demonstrate that students trained with SAM are more low-rank in comparison to baselines
  • Figure 5: Visual comparison of the attention maps with respect to the CLS and DIST tokens for tail images from the ImageNet-LT dataset. The attention maps are computed by Attention Rolloutabnar2020quantifying.
  • ...and 5 more figures