Table of Contents
Fetching ...

On minimizing the training set fill distance in machine learning regression

Paolo Climaco, Jochen Garcke

TL;DR

The paper addresses regression with costly labeling by showing that minimizing the training-set fill distance via Farthest Point Sampling (FPS) yields a bound on the maximum prediction error that scales linearly with the fill distance $h_{\mathcal{L}_{\mathcal{X}},\mathcal{D}_{\mathcal{X}}}$. It develops a model-agnostic theoretical framework and verifies it empirically on QM7, QM8, QM9, and rMD17, demonstrating substantial reductions in the maximum error for several regression models, particularly at low data budgets. It further shows that for Gaussian-kernel regression, FPS can improve numerical stability by increasing the kernel matrix minimum eigenvalue through larger separation distances. The findings underline FPS as a data-efficient, robust sampling strategy with practical impact in domains like molecular property prediction, while acknowledging that benefits depend on data correlations between feature and label spaces and that improvements in average error are not guaranteed.

Abstract

For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.

On minimizing the training set fill distance in machine learning regression

TL;DR

The paper addresses regression with costly labeling by showing that minimizing the training-set fill distance via Farthest Point Sampling (FPS) yields a bound on the maximum prediction error that scales linearly with the fill distance . It develops a model-agnostic theoretical framework and verifies it empirically on QM7, QM8, QM9, and rMD17, demonstrating substantial reductions in the maximum error for several regression models, particularly at low data budgets. It further shows that for Gaussian-kernel regression, FPS can improve numerical stability by increasing the kernel matrix minimum eigenvalue through larger separation distances. The findings underline FPS as a data-efficient, robust sampling strategy with practical impact in domains like molecular property prediction, while acknowledging that benefits depend on data correlations between feature and label spaces and that improvements in average error are not guaranteed.

Abstract

For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.
Paper Structure (21 sections, 2 theorems, 34 equations, 10 figures, 1 algorithm)

This paper contains 21 sections, 2 theorems, 34 equations, 10 figures, 1 algorithm.

Key Result

Theorem 4

Given $\mathcal{D} := \{({\boldsymbol{x}}_q, y_q)\}_{q=1}^k = \mathcal{U} \sqcup \mathcal{L}$ set of independent realizations of the random variables $({\boldsymbol{X}},Y)$ taking values in $\mathcal{Z}:= \mathcal{X} \times \mathcal{Y}$ with joint probability measure $p_{\mathcal{Z}}$, trained model where $h_{\mathcal{L}_{\mathcal{X}}, \mathcal{D}_{\mathcal{X}}}$ is the fill distance of $\mathcal{

Figures (10)

  • Figure 1: Results for regression tasks on the illustrative example with a linear regression model trained on sets of various sizes selected randomly and with the FPS. The maximum absolute error (MAXAE) of the trained linear regression model and the theoretical bound (TB) for the expected maximum error of the linear model, computed as in (\ref{['bound']}), are shown for each training set size. The amount of data used for training is expressed as a percentage of the available data points.
  • Figure 2: Results for regression tasks on QM7, QM8 and QM9 using KRR trained on sets of various sizes, expressed as a percentage of the available data points, and selected with different sampling strategies. MAXAE (top row) and MAE (bottom row) are shown for each training set size and sampling approach.
  • Figure 3: Results regression tasks on QM7, QM8 and QM9 using FNN trained on sets of various sizes, expressed as a percentage of the available data points, and selected with different sampling strategies. MAXAE (top row) and MAE (bottom row) of the predictions are shown for each training set size and sampling approach.
  • Figure 4: Condition number of the regularized (top row) and non-regularized (bottom row) Gaussian kernels are shown for each dataset, training set size and sampling approach. The graphs are on log-log scale and the error bands represent the confidence interval over five independent runs of the experiments.
  • Figure 5: (a) Fill distances of the selected training sets. (b) Euclidean distances to the nearest neighbour and (c) density of such distances for molecules in QM7 (top row), QM8 (middle row) and QM9 (bottom row). In (b) the red lines are the average distances between the molecules in the datasets and their nearest neighbour and the molecules are sequentially numbered such that the distances decrease in magnitude as the associated molecule numbers increase.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Definition 1
  • Theorem 4
  • Remark 5
  • Lemma 6
  • Definition 7