Table of Contents
Fetching ...

Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

Stanley Mugisha, Rashid Kisitu, Florence Tushabe

TL;DR

This work tackles the accuracy-efficiency conflict in agricultural IoT by introducing a hybrid knowledge distillation framework that transfers both logit information and attention maps from a high-performing Swin Transformer teacher to a lightweight MobileNetV3 student. By incorporating adaptive attention alignment and a dual-loss objective, the approach resolves cross-architecture mismatches and preserves both global class relationships and local spatial focus. The method achieves near-teacher performance while meeting stringent edge-device constraints, demonstrated through IoT-centric benchmarks on smartphone and Raspberry Pi hardware, and validated on tomato disease datasets. The results indicate significant potential for real-time, energy-efficient crop monitoring, with open-source code and deployment-ready models to facilitate wider adoption and future multi-modal extensions in precision agriculture.

Abstract

Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and < 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.

Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

TL;DR

This work tackles the accuracy-efficiency conflict in agricultural IoT by introducing a hybrid knowledge distillation framework that transfers both logit information and attention maps from a high-performing Swin Transformer teacher to a lightweight MobileNetV3 student. By incorporating adaptive attention alignment and a dual-loss objective, the approach resolves cross-architecture mismatches and preserves both global class relationships and local spatial focus. The method achieves near-teacher performance while meeting stringent edge-device constraints, demonstrated through IoT-centric benchmarks on smartphone and Raspberry Pi hardware, and validated on tomato disease datasets. The results indicate significant potential for real-time, energy-efficient crop monitoring, with open-source code and deployment-ready models to facilitate wider adoption and future multi-modal extensions in precision agriculture.

Abstract

Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and < 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.

Paper Structure

This paper contains 45 sections, 7 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Hybrid distillation framework addressing cross-architecture challenges through (A) multi-scale attention alignment and (B) adaptive channel projection. Green arrows indicate IoT-optimized components.
  • Figure 2: Mobile application for IoT-driven tomato disease detection. The distilled MobileNetV3 model achieves energy-efficient, real-time inference ($<$100 ms) on edge devices, enabling farmers to diagnose diseases (e.g., 'Tomato Bacterial Spot') via upload or camera capture. This supports scalable precision agriculture in resource-constrained environments. In the background, data can be optionally uploaded to a cloud server for further analysis.
  • Figure 3: Raspberry Pi testbed setup for evaluation of the models. The proposed model's real-time inference ($<$25 ms) and small memory footprint ($<$80 MB) make it suitable for on-device real-time inference
  • Figure 4: Deployment pipeline for the distilled MobileNetV3 model in IoT-driven smart agriculture. Farmers capture or upload crop images via a mobile app or a Raspberry Pi, which undergoes preprocessing and on-device inference for real-time disease diagnosis.