Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

John Violos; Symeon Papadopoulos; Ioannis Kompatsiaris

Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

John Violos, Symeon Papadopoulos, Ioannis Kompatsiaris

TL;DR

This work tackles the practical challenge of knowledge distillation for CNNs and Vision Transformers in edge environments with strict resource limits. It studies four facets—architecture mismatch between teacher and student, student capacity, input resolution, and post-KD fine-tuning—through extensive experiments on CNNs and ViTs (including VGG, DeiT, Swin, ViT) across CIFAR-10/100 and ImageNet-1k. Key findings show CNN–CNN KD is fastest and most efficient on edge devices, while transformers require longer training; higher image resolution benefits small students but can hurt large ones; fine-tuning after KD yields substantial gains for transformer-based students. The results provide actionable guidance for practitioners aiming to balance accuracy and efficiency in edge KD, and the authors propose a future meta-heuristic approach to further optimize the search over architectures and KD configurations under resource constraints.

Abstract

This paper discusses four facets of the Knowledge Distillation (KD) process for Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures, particularly when executed on edge devices with constrained processing capabilities. First, we conduct a comparative analysis of the KD process between CNNs and ViT architectures, aiming to elucidate the feasibility and efficacy of employing different architectural configurations for the teacher and student, while assessing their performance and efficiency. Second, we explore the impact of varying the size of the student model on accuracy and inference speed, while maintaining a constant KD duration. Third, we examine the effects of employing higher resolution images on the accuracy, memory footprint and computational workload. Last, we examine the performance improvements obtained by fine-tuning the student model after KD to specific downstream tasks. Through empirical evaluations and analyses, this research provides AI practitioners with insights into optimal strategies for maximizing the effectiveness of the KD process on edge devices.

Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

TL;DR

Abstract

Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

Authors

TL;DR

Abstract

Table of Contents