SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet
Maurice Günder, Facundo Ramón Ispizua Yamati, Abel Andree Barreto Alcántara, Anne-Katrin Mahlein, Rafet Sifa, Christian Bauckhage
TL;DR
This work introduces SugarViT, a Vision Transformer–based framework for multi-objective regression on UAV multispectral imagery to predict disease severity in sugar beet CLS, leveraging Deep Label Distribution Learning to model label uncertainty. The model combines a shared ViT backbone with an MLP neck, multiple LDL heads, and a Feature Mixing stage to produce per-label probability distributions, while pretraining on environmental metadata (GDD and NPG) accelerates convergence and improves robustness. A novel LDL loss composition—L_ld, L_exp, and L_smooth—enables scale-invariant, uncertainty-aware training across multiple phenological targets. Field-ready outputs, attention-based interpretability via attention maps, and GIS-friendly exports demonstrate SugarViT’s practical potential for scalable, data-efficient UAV-based plant phenotyping and disease management, with possible extension to broader tasks like disease spread modeling.
Abstract
Remote sensing and artificial intelligence are pivotal technologies of precision agriculture nowadays. The efficient retrieval of large-scale field imagery combined with machine learning techniques shows success in various tasks like phenotyping, weeding, cropping, and disease control. This work will introduce a machine learning framework for automatized large-scale plant-specific trait annotation for the use case disease severity scoring for Cercospora Leaf Spot (CLS) in sugar beet. With concepts of Deep Label Distribution Learning (DLDL), special loss functions, and a tailored model architecture, we develop an efficient Vision Transformer based model for disease severity scoring called SugarViT. One novelty in this work is the combination of remote sensing data with environmental parameters of the experimental sites for disease severity prediction. Although the model is evaluated on this special use case, it is held as generic as possible to also be applicable to various image-based classification and regression tasks. With our framework, it is even possible to learn models on multi-objective problems as we show by a pretraining on environmental metadata.
