On minimizing the training set fill distance in machine learning regression
Paolo Climaco, Jochen Garcke
TL;DR
The paper addresses regression with costly labeling by showing that minimizing the training-set fill distance via Farthest Point Sampling (FPS) yields a bound on the maximum prediction error that scales linearly with the fill distance $h_{\mathcal{L}_{\mathcal{X}},\mathcal{D}_{\mathcal{X}}}$. It develops a model-agnostic theoretical framework and verifies it empirically on QM7, QM8, QM9, and rMD17, demonstrating substantial reductions in the maximum error for several regression models, particularly at low data budgets. It further shows that for Gaussian-kernel regression, FPS can improve numerical stability by increasing the kernel matrix minimum eigenvalue through larger separation distances. The findings underline FPS as a data-efficient, robust sampling strategy with practical impact in domains like molecular property prediction, while acknowledging that benefits depend on data correlations between feature and label spaces and that improvements in average error are not guaranteed.
Abstract
For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.
