reBEN: Refined BigEarthNet Dataset for Remote Sensing Image Analysis
Kai Norman Clasen, Leonard Hackel, Tom Burgert, Gencer Sumbul, Begüm Demir, Volker Markl
TL;DR
The paper addresses the reliability and quality problems in large-scale remote sensing benchmarks like BigEarthNet by introducing reBEN, a refined dataset built from Sentinel-1/2 patches sized $1200 \mathrm{m} \times 1200 \mathrm{m}$. It reprocesses Sentinel-2 data with the latest atmospheric correction tool sen2cor v2.11 to level-2A, updates labeling using the 2018 CORINE Land Cover map, and overlays pixel-level reference maps to enable pixel- and scene-based learning. A geographical-based split reduces spatial correlation across train/validation/test, and supplementary software (rico-hdl) enables DL-friendly data formats with pre-trained weights released for reproducibility. The resulting 549,488 patch pairs, along with open code and tools, aim to provide more reliable, interpretable DL research for remote sensing image analysis and faster model training through optimized data formats.
Abstract
This paper presents refined BigEarthNet (reBEN) that is a large-scale, multi-modal remote sensing dataset constructed to support deep learning (DL) studies for remote sensing image analysis. The reBEN dataset consists of 549,488 pairs of Sentinel-1 and Sentinel-2 image patches. To construct reBEN, we initially consider the Sentinel-1 and Sentinel-2 tiles used to construct the BigEarthNet dataset and then divide them into patches of size 1200 m x 1200 m. We apply atmospheric correction to the Sentinel-2 patches using the latest version of the sen2cor tool, resulting in higher-quality patches compared to those present in BigEarthNet. Each patch is then associated with a pixel-level reference map and scene-level multi-labels. This makes reBEN suitable for pixel- and scene-based learning tasks. The labels are derived from the most recent CORINE Land Cover (CLC) map of 2018 by utilizing the 19-class nomenclature as in BigEarthNet. The use of the most recent CLC map results in overcoming the label noise present in BigEarthNet. Furthermore, we introduce a new geographical-based split assignment algorithm that significantly reduces the spatial correlation among the train, validation, and test sets with respect to those present in BigEarthNet. This increases the reliability of the evaluation of DL models. To minimize the DL model training time, we introduce software tools that convert the reBEN dataset into a DL-optimized data format. In our experiments, we show the potential of reBEN for multi-modal multi-label image classification problems by considering several state-of-the-art DL models. The pre-trained model weights, associated code, and complete dataset are available at https://bigearth.net.
