Kriging and Gaussian Process Interpolation for Georeferenced Data Augmentation
Frédérick Fabre Ferber, Dominique Gay, Jean-Christophe Soulié, Jean Diatta, Odalric-Ambrym Maillard
TL;DR
The paper tackles data augmentation for geo-referenced, data-scarce datasets by evaluating interpolation methods—Gaussian processes with multiple kernels and kriging with several variograms—to augment observations predicting weed cover (Commelina benghalensis L.) on Reunion Island sugarcane plots. It systematically compares predictive performance across multiple regression algorithms, analyzes how performance scales with added points, and assesses the spatial consistency of augmented data via density maps. The results show that multikernel GP augmentation (notably GP-COMB) generally delivers the strongest predictive gains and faster convergence, while kriging provides more homogeneous spatial coverage. These findings support applying GP-based geo-referenced augmentation to similar spatially structured, limited-data problems and point to future work on multi-label extensions and broader geographic datasets.
Abstract
Data augmentation is a crucial step in the development of robust supervised learning models, especially when dealing with limited datasets. This study explores interpolation techniques for the augmentation of geo-referenced data, with the aim of predicting the presence of Commelina benghalensis L. in sugarcane plots in La R{é}union. Given the spatial nature of the data and the high cost of data collection, we evaluated two interpolation approaches: Gaussian processes (GPs) with different kernels and kriging with various variograms. The objectives of this work are threefold: (i) to identify which interpolation methods offer the best predictive performance for various regression algorithms, (ii) to analyze the evolution of performance as a function of the number of observations added, and (iii) to assess the spatial consistency of augmented datasets. The results show that GP-based methods, in particular with combined kernels (GP-COMB), significantly improve the performance of regression algorithms while requiring less additional data. Although kriging shows slightly lower performance, it is distinguished by a more homogeneous spatial coverage, a potential advantage in certain contexts.
