SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet

Maurice Günder; Facundo Ramón Ispizua Yamati; Abel Andree Barreto Alcántara; Anne-Katrin Mahlein; Rafet Sifa; Christian Bauckhage

SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet

Maurice Günder, Facundo Ramón Ispizua Yamati, Abel Andree Barreto Alcántara, Anne-Katrin Mahlein, Rafet Sifa, Christian Bauckhage

TL;DR

This work introduces SugarViT, a Vision Transformer–based framework for multi-objective regression on UAV multispectral imagery to predict disease severity in sugar beet CLS, leveraging Deep Label Distribution Learning to model label uncertainty. The model combines a shared ViT backbone with an MLP neck, multiple LDL heads, and a Feature Mixing stage to produce per-label probability distributions, while pretraining on environmental metadata (GDD and NPG) accelerates convergence and improves robustness. A novel LDL loss composition—L_ld, L_exp, and L_smooth—enables scale-invariant, uncertainty-aware training across multiple phenological targets. Field-ready outputs, attention-based interpretability via attention maps, and GIS-friendly exports demonstrate SugarViT’s practical potential for scalable, data-efficient UAV-based plant phenotyping and disease management, with possible extension to broader tasks like disease spread modeling.

Abstract

Remote sensing and artificial intelligence are pivotal technologies of precision agriculture nowadays. The efficient retrieval of large-scale field imagery combined with machine learning techniques shows success in various tasks like phenotyping, weeding, cropping, and disease control. This work will introduce a machine learning framework for automatized large-scale plant-specific trait annotation for the use case disease severity scoring for Cercospora Leaf Spot (CLS) in sugar beet. With concepts of Deep Label Distribution Learning (DLDL), special loss functions, and a tailored model architecture, we develop an efficient Vision Transformer based model for disease severity scoring called SugarViT. One novelty in this work is the combination of remote sensing data with environmental parameters of the experimental sites for disease severity prediction. Although the model is evaluated on this special use case, it is held as generic as possible to also be applicable to various image-based classification and regression tasks. With our framework, it is even possible to learn models on multi-objective problems as we show by a pretraining on environmental metadata.

SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet

TL;DR

Abstract

Paper Structure (32 sections, 11 equations, 12 figures, 10 tables)

This paper contains 32 sections, 11 equations, 12 figures, 10 tables.

Introduction
Materials and Methods
Data and Preprocessing
Available Field Data
Image Normalization
Data Augmentation
Use Case: Disease Severity Estimation
Deep Label Distribution Learning
Full Kullback-Leibler Divergence Loss
Multi-Head Regression
Model Architecture
Vision Transformer Backbone
MLP Neck
LDL Heads
Feature Mixing
...and 17 more sections

Figures (12)

Figure 1: Histograms of available labels for DS, NPG, and GDD separated by train/validation and test data.
Figure 2: Example images shown by its separate channel components and processed with total and channel-wise standardization, respectively.
Figure 3: Used disease severity scale for our prediction model with example images. The scale is based on the usual CLS rating scale. We added the 0 for non-infested sugar beets before canopy closure, and the 10 for newly sprouted plants as in facu_sugarindustry.
Figure 4: Sketch of our proposed Multi Deep Label Distribution Learning (Multi-DLDL) network with a ViT backbone. The LDL heads are trained with separate optimizers and loss functions. The ViT and MLP part are the joint basis and are trained in each backward pass of the LDL heads. As output of the ViT, the last hidden state of the learnable class token is used. Furthermore, our use case is shown by having multispectral plant image data and two training stages. The pretraining is done on the environmental, field-related quantities GDD and NPG. The target label DS is trained in the subsequent finetuning stage. In principle, the model can be generalized to more labels in each training stage by adding more LDL heads.
Figure 5: Output of SugarViT. The DS labels are learned as label distributions (green curves). SugarViT outputs again probability distributions (blue curve). The prediction in the end is the expectation value of the output distributions (dashed lines).
...and 7 more figures

SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet

TL;DR

Abstract

SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet

Authors

TL;DR

Abstract

Table of Contents

Figures (12)