Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning
Dakota Hester, Vitor S. Martins, Lucas B. Ferreira, Thainara M. A. Lima
TL;DR
The paper tackles the data scarcity barrier in VHSR land cover mapping by employing BYOL self-supervised pretraining on a large corpus of unlabeled NAIP CIR imagery to learn a ResNet-101 encoder. This encoder is transferred to multiple semantic segmentation architectures and fine-tuned with only 1,000 labeled patches to produce a 1 m, 8-class land cover map for Mississippi, validated with 25,000 test points and a statewide ensemble of predictions. Across linear probing and end-to-end fine-tuning, the approach yields strong gains over ImageNet baselines, culminating in a final 1 m Mississippi product with macro F1 ≈ 75.6% and overall accuracy ≈ 87.1%, illustrating the practical potential of label-efficient VHSR mapping via self-supervised learning. The method provides a scalable blueprint for operational, high-resolution land cover mapping and highlights the value of in-domain pre-training, model ensembles, and cross-validation in data-scarce contexts.
Abstract
Deep learning semantic segmentation methods have shown promising performance for very high 1-m resolution land cover classification, but the challenge of collecting large volumes of representative training data creates a significant barrier to widespread adoption of such models for meter-scale land cover mapping over large areas. In this study, we present a novel label-efficient approach for statewide 1-m land cover classification using only 1,000 annotated reference image patches with self-supervised deep learning. We use the "Bootstrap Your Own Latent" pre-training strategy with a large amount of unlabeled color-infrared aerial images (377,921 256x256 1-m pixel patches) to pre-train a ResNet-101 convolutional encoder. The learned encoder weights were subsequently transferred into multiple deep semantic segmentation architectures (FCN, U-Net, Attention U-Net, DeepLabV3+, UPerNet, PAN), which were then fine-tuned using very small training dataset sizes with cross-validation (250, 500, 750 patches). Among the fine-tuned models, we obtained the 87.14% overall accuracy and 75.58% macro F1 score using an ensemble of the best performing U-Net models for comprehensive 1-m, 8-class land cover mapping, covering more than 123 billion pixels over the state of Mississippi, USA. Detailed qualitative and quantitative analysis revealed accurate mapping of open water and forested areas, while highlighting challenges in accurate delineation between cropland, herbaceous, and barren land cover types. These results show that self-supervised learning is an effective strategy for reducing the need for large volumes of manually annotated data, directly addressing a major limitation to high spatial resolution land cover mapping at scale.
