Table of Contents
Fetching ...

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Gousia Habib, Damandeep Singh, Ishfaq Ahmad Malik, Brejesh Lall

TL;DR

This work addresses deploying Vision Transformers (ViTs) on resource-constrained devices by proposing data-free knowledge distillation (DFKD) to compress large ViTs without access to original data. It introduces Transformer-augmented GANs that leverage patch-level attention and attention probes to generate high-quality synthetic data, guided by an attention-consistency loss and an overall generator loss $L_G = L_{ ext{adv}} + \lambda L_{ ext{attention}}$. For classification, the method combines $L_{ ext{KD}}$, $L_{ ext{CE}}$, and $L_{ ext{patch}}$ into $L_{ ext{total}} = \lambda_{ ext{KD}} L_{ ext{KD}} + \lambda_{ ext{CE}} L_{ ext{CE}} + \lambda_{ ext{patch}} L_{ ext{patch}}$, enabling effective distillation without original data across MNIST and CIFAR-10, with competitive accuracy and significant model compression. The approach extends to object detection by distilling DETR-based teachers to lighter DETR-based students using classification, bounding-box, and distillation losses, achieving practical performance on a drone detection dataset. Overall, the paper demonstrates that data-free, transformer-aware distillation can enable robust ViT deployment on edge devices while preserving core recognition capabilities in both classification and detection tasks.

Abstract

The groundbreaking performance of transformers in Natural Language Processing (NLP) tasks has led to their replacement of traditional Convolutional Neural Networks (CNNs), owing to the efficiency and accuracy achieved through the self-attention mechanism. This success has inspired researchers to explore the use of transformers in computer vision tasks to attain enhanced long-term semantic awareness. Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies using the self-attention mechanism. Contemporary ViTs like Data Efficient Transformers (DeiT) can effectively learn both global semantic information and local texture information from images, achieving performance comparable to traditional CNNs. However, their impressive performance comes with a high computational cost due to very large number of parameters, hindering their deployment on devices with limited resources like smartphones, cameras, drones etc. Additionally, ViTs require a large amount of data for training to achieve performance comparable to benchmark CNN models. Therefore, we identified two key challenges in deploying ViTs on smaller form factor devices: the high computational requirements of large models and the need for extensive training data. As a solution to these challenges, we propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability. Additionally, we conducted experiments on object detection within the same environment in addition to classification tasks. Based on our analysis, we found that datafree knowledge distillation is an effective method to overcome both issues, enabling the deployment of ViTs on less resourceconstrained devices.

Optimizing Vision Transformers with Data-Free Knowledge Transfer

TL;DR

This work addresses deploying Vision Transformers (ViTs) on resource-constrained devices by proposing data-free knowledge distillation (DFKD) to compress large ViTs without access to original data. It introduces Transformer-augmented GANs that leverage patch-level attention and attention probes to generate high-quality synthetic data, guided by an attention-consistency loss and an overall generator loss . For classification, the method combines , , and into , enabling effective distillation without original data across MNIST and CIFAR-10, with competitive accuracy and significant model compression. The approach extends to object detection by distilling DETR-based teachers to lighter DETR-based students using classification, bounding-box, and distillation losses, achieving practical performance on a drone detection dataset. Overall, the paper demonstrates that data-free, transformer-aware distillation can enable robust ViT deployment on edge devices while preserving core recognition capabilities in both classification and detection tasks.

Abstract

The groundbreaking performance of transformers in Natural Language Processing (NLP) tasks has led to their replacement of traditional Convolutional Neural Networks (CNNs), owing to the efficiency and accuracy achieved through the self-attention mechanism. This success has inspired researchers to explore the use of transformers in computer vision tasks to attain enhanced long-term semantic awareness. Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies using the self-attention mechanism. Contemporary ViTs like Data Efficient Transformers (DeiT) can effectively learn both global semantic information and local texture information from images, achieving performance comparable to traditional CNNs. However, their impressive performance comes with a high computational cost due to very large number of parameters, hindering their deployment on devices with limited resources like smartphones, cameras, drones etc. Additionally, ViTs require a large amount of data for training to achieve performance comparable to benchmark CNN models. Therefore, we identified two key challenges in deploying ViTs on smaller form factor devices: the high computational requirements of large models and the need for extensive training data. As a solution to these challenges, we propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability. Additionally, we conducted experiments on object detection within the same environment in addition to classification tasks. Based on our analysis, we found that datafree knowledge distillation is an effective method to overcome both issues, enabling the deployment of ViTs on less resourceconstrained devices.
Paper Structure (23 sections, 17 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 17 equations, 19 figures, 5 tables, 1 algorithm.

Figures (19)

  • Figure 1: Image, Attention Map and Corresponding Attention Probes
  • Figure 2: Class Attention Probe
  • Figure 3: Class Attention Probe-MNIST
  • Figure 4: Class Attention Probe-CIFAR10
  • Figure 5: Proposed Transformer-Augmented GAN Framework
  • ...and 14 more figures