Table of Contents
Fetching ...

Better, Not Just More: Data-Centric Machine Learning for Earth Observation

Ribana Roscher, Marc Rußwurm, Caroline Gevaert, Michael Kampffmeyer, Jefersson A. dos Santos, Maria Vakalopoulou, Ronny Hänsch, Stine Hansen, Keiller Nogueira, Jonathan Prexl, Devis Tuia

TL;DR

This paper argues for a data-centric shift in Earth Observation machine learning, contending that dataset quality and the full ML deployment cycle are critical for real-world impact beyond model-only improvements. It defines data-centric learning, catalogs the five geospatial data quality criteria, and surveys techniques across data creation, curation, training utilization, and evaluation. Through three validation studies on the DFC2020 land cover dataset, it demonstrates that targeted data-centric actions—such as relevance-weighting, confident-learning-based pruning, and slice discovery—can yield tangible gains while also exposing risks of negative transfer and limited gains in some cases. The work highlights gaps in standardized data-quality metrics and automated data-centric tooling, calling for broader evaluation, feedback mechanisms, and integrated advances that couple data with model considerations to improve robustness across diverse geographies and conditions.

Abstract

Recent developments and research in modern machine learning have led to substantial improvements in the geospatial field. Although numerous deep learning architectures and models have been proposed, the majority of them have been solely developed on benchmark datasets that lack strong real-world relevance. Furthermore, the performance of many methods has already saturated on these datasets. We argue that a shift from a model-centric view to a complementary data-centric perspective is necessary for further improvements in accuracy, generalization ability, and real impact on end-user applications. Furthermore, considering the entire machine learning cycle-from problem definition to model deployment with feedback-is crucial for enhancing machine learning models that can be reliable in unforeseen situations. This work presents a definition as well as a precise categorization and overview of automated data-centric learning approaches for geospatial data. It highlights the complementary role of data-centric learning with respect to model-centric in the larger machine learning deployment cycle. We review papers across the entire geospatial field and categorize them into different groups. A set of representative experiments shows concrete implementation examples. These examples provide concrete steps to act on geospatial data with data-centric machine learning approaches.

Better, Not Just More: Data-Centric Machine Learning for Earth Observation

TL;DR

This paper argues for a data-centric shift in Earth Observation machine learning, contending that dataset quality and the full ML deployment cycle are critical for real-world impact beyond model-only improvements. It defines data-centric learning, catalogs the five geospatial data quality criteria, and surveys techniques across data creation, curation, training utilization, and evaluation. Through three validation studies on the DFC2020 land cover dataset, it demonstrates that targeted data-centric actions—such as relevance-weighting, confident-learning-based pruning, and slice discovery—can yield tangible gains while also exposing risks of negative transfer and limited gains in some cases. The work highlights gaps in standardized data-quality metrics and automated data-centric tooling, calling for broader evaluation, feedback mechanisms, and integrated advances that couple data with model considerations to improve robustness across diverse geographies and conditions.

Abstract

Recent developments and research in modern machine learning have led to substantial improvements in the geospatial field. Although numerous deep learning architectures and models have been proposed, the majority of them have been solely developed on benchmark datasets that lack strong real-world relevance. Furthermore, the performance of many methods has already saturated on these datasets. We argue that a shift from a model-centric view to a complementary data-centric perspective is necessary for further improvements in accuracy, generalization ability, and real impact on end-user applications. Furthermore, considering the entire machine learning cycle-from problem definition to model deployment with feedback-is crucial for enhancing machine learning models that can be reliable in unforeseen situations. This work presents a definition as well as a precise categorization and overview of automated data-centric learning approaches for geospatial data. It highlights the complementary role of data-centric learning with respect to model-centric in the larger machine learning deployment cycle. We review papers across the entire geospatial field and categorize them into different groups. A set of representative experiments shows concrete implementation examples. These examples provide concrete steps to act on geospatial data with data-centric machine learning approaches.
Paper Structure (16 sections, 6 figures, 2 tables)

This paper contains 16 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Steps of the machine learning cycle. Each step highlights one way to interact with data and its quality, and each step can employ multiple techniques to perform the interaction (see Section \ref{['sec:steps']}). It involves problem definition, data creation and curation, model training and evaluation, and final deployment, which feeds back into a modified problem definition. Model-centric learning focuses primarily on model training and evaluation, while data-centric learning involves algorithms covering data curation, creation, specific training strategies, evaluation, and deployment feedback. The considered quality criteria are diversity and completeness, accuracy, consistency, unbiasedness, and relevance, see Section \ref{['sec:terminology']}).
  • Figure 2: Data-centric machine learning techniques and papers referenced in \ref{['sec:steps']}. Each machine learning step and each technique (rows) interacts in a specific way with the data and the quality (columns). The techniques used for our experiments are highlighted in bold. Large dots indicate which quality information is used (model training and evaluation), or which specific quality criteria is acted on (in the creation and curation step).
  • Figure 3: Study 1. Analyzing geographic domain shift with KLIEP to derive a relevance-weighting scheme for model training.
  • Figure 4: Qualitative results, Experiment 2. (a) Potential label issues detected by Confident Learning in the DFC training set. The model's predicted class probabilities are shown as bar charts. (b) Histogram of the label quality of class "Urban/Built-up" with three example images.
  • Figure 5: The confusion matrices (left: validation set, right: test set) of the model described in \ref{['sec:expsetup']} for validation (left) and test set (right).
  • ...and 1 more figures