Table of Contents
Fetching ...

Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT

Shuhei Tsuyuki, Reda Bensaid, Jérémy Morlier, Mathieu Léonardon, Naoya Onizawa, Vincent Gripon, Takahiro Hanyu

Abstract

Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.

Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT

Abstract

Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.

Paper Structure

This paper contains 16 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Few-Shot Learning uses a small support set to train and a query set to test. The number of classes is called the way, and the number of examples per class is the shots.
  • Figure 2: FSL consists of three steps: backbone learning for feature extraction, learning from a small amount of data using a pre-trained backbone, and evaluating the inference ability for classification tasks.
  • Figure 3: Overview of the proposed method: MobileViT-XXS is used as the backbone, and knowledge distillation is performed as pre-training. After this, the backbone is not updated, and training and inference ability are evaluated using a small number of samples, as in Figure \ref{['fig:easy']}.
  • Figure 4: Performance comparison of EfficientNet (B0–B3), MobileNetV3 (S, L), and MobileViT (S, XS, XXS) backbones in the first stage of the EASY pipeline without ensemble features or augmented shots. All models were trained on $84 \times 84$ images with batch size 376 for 100 epochs using a learning rate of 0.01 with $\gamma=0.1$. Accuracy after pre-training is reported as a proxy for backbone quality in few-shot learning.