Table of Contents
Fetching ...

HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification

Omar S. EL-Assiouti, Ghada Hamed, Dina Khattab, Hala M. Ebied

TL;DR

This work tackles data scarcity in medical image classification with Vision Transformers by introducing Hybrid Data-Efficient Knowledge Distillation (HDKD), where a CNN teacher distills both logits and intermediate features to a lightweight hybrid student that combines CNN inductive biases with transformer-based global processing. A novel Mobile Channel-Spatial Attention (MBCSA) block enables the shared convolutional backbone for teacher and student, allowing direct feature distillation without alignment overhead, while a Distilled Feature-level Transformer (DFLT) handles the final global reasoning. The approach shows consistent improvements over non-distilled baselines and competitive, if not superior, performance against state-of-the-art models on Brain Tumor MRI and HAM-10000 datasets, with particular strength when training data are limited and when deployment efficiency matters. These results demonstrate HDKD’s potential for robust, data-efficient medical image classification and its suitability for edge devices due to its lightweight design.

Abstract

Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity. However, their performance notably degrades when trained with insufficient data due to lack of inherent inductive biases. Distilling knowledge and inductive biases from a Convolutional Neural Network (CNN) teacher has emerged as an effective strategy for enhancing the generalization of ViTs on limited datasets. Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student, neglecting the rich semantic information present in intermediate features due to the structural differences between them. Others integrated feature distillation along with logit distillation, yet this introduced alignment operations that limits the amount of knowledge transferred due to mismatched architectures and increased the computational overhead. To this end, this paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student. The choice of hybrid student serves two main aspects. First, it leverages the strengths of both convolutions and transformers while sharing the convolutional structure with the teacher model. Second, this shared structure enables the direct application of feature distillation without any information loss or additional computational overhead. Additionally, we propose an efficient light-weight convolutional block named Mobile Channel-Spatial Attention (MBCSA), which serves as the primary convolutional block in both teacher and student models. Extensive experiments on two medical public datasets showcase the superiority of HDKD over other state-of-the-art models and its computational efficiency. Source code at: https://github.com/omarsherif200/HDKD

HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification

TL;DR

This work tackles data scarcity in medical image classification with Vision Transformers by introducing Hybrid Data-Efficient Knowledge Distillation (HDKD), where a CNN teacher distills both logits and intermediate features to a lightweight hybrid student that combines CNN inductive biases with transformer-based global processing. A novel Mobile Channel-Spatial Attention (MBCSA) block enables the shared convolutional backbone for teacher and student, allowing direct feature distillation without alignment overhead, while a Distilled Feature-level Transformer (DFLT) handles the final global reasoning. The approach shows consistent improvements over non-distilled baselines and competitive, if not superior, performance against state-of-the-art models on Brain Tumor MRI and HAM-10000 datasets, with particular strength when training data are limited and when deployment efficiency matters. These results demonstrate HDKD’s potential for robust, data-efficient medical image classification and its suitability for edge devices due to its lightweight design.

Abstract

Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity. However, their performance notably degrades when trained with insufficient data due to lack of inherent inductive biases. Distilling knowledge and inductive biases from a Convolutional Neural Network (CNN) teacher has emerged as an effective strategy for enhancing the generalization of ViTs on limited datasets. Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student, neglecting the rich semantic information present in intermediate features due to the structural differences between them. Others integrated feature distillation along with logit distillation, yet this introduced alignment operations that limits the amount of knowledge transferred due to mismatched architectures and increased the computational overhead. To this end, this paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student. The choice of hybrid student serves two main aspects. First, it leverages the strengths of both convolutions and transformers while sharing the convolutional structure with the teacher model. Second, this shared structure enables the direct application of feature distillation without any information loss or additional computational overhead. Additionally, we propose an efficient light-weight convolutional block named Mobile Channel-Spatial Attention (MBCSA), which serves as the primary convolutional block in both teacher and student models. Extensive experiments on two medical public datasets showcase the superiority of HDKD over other state-of-the-art models and its computational efficiency. Source code at: https://github.com/omarsherif200/HDKD
Paper Structure (22 sections, 8 equations, 5 figures, 12 tables)

This paper contains 22 sections, 8 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Comparison between MBConv block and MBCSA Block. (a) is the standard MobileNet block with SE module, (b) is the proposed MBCSA block which replaces the SE module with CBAM module to adjust the relevance of both spatial and channel information.
  • Figure 2: Overview of the proposed teacher-student (HDKD) paradigm. The process involves initially training the teacher model, followed by training the student model. During training the student model, it leverages the knowledge distilled from the teacher model through logit and feature distillation techniques.
  • Figure 3: Comparison of the distilled student version (HDKD) with its non-distilled version across various data sizes on brain tumor MRI dataset.
  • Figure 4: Comparison of the distilled student version (HDKD) with its non-distilled version across various data sizes on HAM-10000 dataset.
  • Figure 5: Feature visualization analysis. We visualize the activation maps of the last block in the final convolutional stage (3rd stage) for each of the teacher model, student model without distillation, and HDKD model. These activation maps are obtained by averaging across the channel dimension, resulting in a single global channel of size 28$\times$28.