On Convolutional Vision Transformers for Yield Prediction

Alvin Inderka; Florian Huber; Volker Steinhage

On Convolutional Vision Transformers for Yield Prediction

Alvin Inderka, Florian Huber, Volker Steinhage

TL;DR

The paper investigates the feasibility of applying the Convolutional vision Transformer (CvT) to yield prediction from histogram-based MODIS data, comparing its performance against CNNs and XGBoost on soybean yields. CvT is adapted for regression with histogram-encoded inputs, using mean-squared error loss and standard regression metrics ($RMSE$, $R^2$), and evaluated on a large US dataset spanning 2018–2021 test years. Results indicate CvT configurations lag behind CNN and XGBoost in both end-of-year and in-year yield predictions, though CvT provides insights into Transformer applicability and potential gains with larger datasets or new pretraining strategies. The study highlights the importance of local feature extraction for this task and suggests avenues like Swin transformers and improved pre-processing/tokenization to unlock Transformer advantages in remote-sensing yield prediction.

Abstract

While a variety of methods offer good yield prediction on histogrammed remote sensing data, vision Transformers are only sparsely represented in the literature. The Convolution vision Transformer (CvT) is being tested to evaluate vision Transformers that are currently achieving state-of-the-art results in many other vision tasks. CvT combines some of the advantages of convolution with the advantages of dynamic attention and global context fusion of Transformers. It performs worse than widely tested methods such as XGBoost and CNNs, but shows that Transformers have potential to improve yield prediction.

On Convolutional Vision Transformers for Yield Prediction

TL;DR

), and evaluated on a large US dataset spanning 2018–2021 test years. Results indicate CvT configurations lag behind CNN and XGBoost in both end-of-year and in-year yield predictions, though CvT provides insights into Transformer applicability and potential gains with larger datasets or new pretraining strategies. The study highlights the importance of local feature extraction for this task and suggests avenues like Swin transformers and improved pre-processing/tokenization to unlock Transformer advantages in remote-sensing yield prediction.

Abstract

Paper Structure (10 sections, 4 tables)

This paper contains 10 sections, 4 tables.

Introduction
Related Work
Yield Prediction
Vision Transformer
Data
Methods
Results
Configurations of CvT
Comparison between XGBoost, CNN and CvT
Conclusion

On Convolutional Vision Transformers for Yield Prediction

TL;DR

Abstract

On Convolutional Vision Transformers for Yield Prediction

Authors

TL;DR

Abstract

Table of Contents