Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling

Sambal Shikhar; Anupam Sobti

Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling

Sambal Shikhar, Anupam Sobti

TL;DR

This work tackles the problem of detecting diverse, label-free anomalies in high-resolution agricultural UAV imagery. It leverages a Swin Transformer-based Masked Autoencoder (SwinMAE) to learn normal field patterns from unlabeled data and uses reconstruction errors as anomaly cues. A novel Anomaly Suppression Loss further stabilizes training by downweighting anomaly reconstructions, enabling a single model to generalize across multiple anomaly classes on the Agriculture-Vision dataset. The approach achieves state-of-the-art performance among unsupervised/self-supervised methods and offers a practical, annotation-efficient pipeline for early field stress identification in precision agriculture.

Abstract

Detecting various types of stresses (nutritional, water, nitrogen, etc.) in agricultural fields is critical for farmers to ensure maximum productivity. However, stresses show up in different shapes and sizes across different crop types and varieties. Hence, this is posed as an anomaly detection task in agricultural images. Accurate anomaly detection in agricultural UAV images is vital for early identification of field irregularities. Traditional supervised learning faces challenges in adapting to diverse anomalies, necessitating extensive annotated data. In this work, we overcome this limitation with self-supervised learning using a masked image modeling approach. Masked Autoencoders (MAE) extract meaningful normal features from unlabeled image samples which produces high reconstruction error for the abnormal pixels during reconstruction. To remove the need of using only ``normal" data while training, we use an anomaly suppression loss mechanism that effectively minimizes the reconstruction of anomalous pixels and allows the model to learn anomalous areas without explicitly separating ``normal" images for training. Evaluation on the Agriculture-Vision data challenge shows a mIOU score improvement in comparison to prior state of the art in unsupervised and self-supervised methods. A single model generalizes across all the anomaly categories in the Agri-Vision Challenge Dataset

Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling

TL;DR

Abstract

Paper Structure (14 sections, 21 equations, 6 figures, 2 tables)

This paper contains 14 sections, 21 equations, 6 figures, 2 tables.

Introduction
Related Work
Background
Masked Auto-encoder
Swin Transformers
SwinMAE (Swin Masked Auto-encoder)
Anomaly Suppression Loss
Proposed Method
Experimental Setup
Dataset
Evaluation Metrics
Implementation Details
Results
Conclusion and Future Work

Figures (6)

Figure 1: Comparison of anomaly datasets: The left column represents a variety of industrial and other hyperspectral anomaly detection (AD) datasets, including MV-Tec, ABU-Airport, and Cri Image Hyperion dataset of Viareggio. The right column displays the Agri-Vision Challenge Dataset, highlighting agricultural anomalies such as Weed Clusters, Water stress, and Nutrient Deficiency. This illustrates the complexity of agricultural anomalies, showcasing their large inter-class and intra-class variations and their occurrence at multiple scales, as opposed to more uniform and scale-consistent anomalies found in industrial datasets.
Figure 2: Input image is masked and the unmasked image patches are fed into the encoder which embeds each of those patches ,the decoder takes in embed patches along with masked patches to reconstruct the input image.A reconstruction error map is generated which is then used to generated the final Anomaly map
Figure 3: Comparison between masking methods. (a) original image (b) Normal random masking method (c) Window masking method.
Figure 4: Architecture of the Swin Masked Autoencoder (Swin MAE) for anomaly detection. The encoder, leveraging Swin Transformer blocks, processes the input image through stages of patch partitioning, embedding, and window masking, followed by successive Transformer and merging layers to create high-dimensional token representations. The decoder employs a sequence of expanding, Transformer, and normalization layers before projecting back to the pixel space, resulting in the reconstructed image and its corresponding reconstruction error map.
Figure 5: Anomaly Detection using Swin Masked Auto-encoder. An UAV image of shape Height,Width,Number of Bands (H,W,B) is input to the Swin MAE encoder where K window masked images are fed into the rest of the encoder comprised of Swin Tranformer and Patch Merging layers. The Swin MAE decoder then produces resulting K predicted images (K,H,W,B) . The predicted image is compared to the original input image to produce a reconstruction error map, which is thresholded using Knee-point calculation producing the final binary anomaly map delineating the detected anomalies within the image.
...and 1 more figures

Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling

TL;DR

Abstract

Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)