Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

Stanley Mugisha; Rashid Kisitu; Florence Tushabe

Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

Stanley Mugisha, Rashid Kisitu, Florence Tushabe

TL;DR

This work tackles the accuracy-efficiency conflict in agricultural IoT by introducing a hybrid knowledge distillation framework that transfers both logit information and attention maps from a high-performing Swin Transformer teacher to a lightweight MobileNetV3 student. By incorporating adaptive attention alignment and a dual-loss objective, the approach resolves cross-architecture mismatches and preserves both global class relationships and local spatial focus. The method achieves near-teacher performance while meeting stringent edge-device constraints, demonstrated through IoT-centric benchmarks on smartphone and Raspberry Pi hardware, and validated on tomato disease datasets. The results indicate significant potential for real-time, energy-efficient crop monitoring, with open-source code and deployment-ready models to facilitate wider adoption and future multi-modal extensions in precision agriculture.

Abstract

Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and < 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.

Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

TL;DR

Abstract

Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)