Table of Contents
Fetching ...

Density-Aware Farthest Point Sampling

Paolo Climaco, Jochen Garcke

TL;DR

This work tackles data-efficient regression under label scarcity by proposing DA-FPS, a passive, model-agnostic sampling method guided by a data-driven bound that is linear in a weighted fill distance between training and data-source marginals. DA-FPS estimates data densities with adaptive $k$-NN methods and greedily selects points to minimize an estimated weighted fill distance, thereby promoting coverage of sparse regions while preserving representation of dense regions. Theoretical results give a $2k$-approximation guarantee for the estimated objective, and extensive experiments on molecular-property datasets show DA-FPS consistently improves mean absolute error and robustness (vs. baselines like FPS and random sampling) across regression models, especially at larger training budgets. The approach advances data-driven coreset design by incorporating distributional alignment into a model-agnostic sampling objective, with practical impact for chemistry and other domains where labeling is costly.

Abstract

We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set: a quantity we can estimate simply by considering the data features. We introduce ''Density-Aware Farthest Point Sampling'' (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

Density-Aware Farthest Point Sampling

TL;DR

This work tackles data-efficient regression under label scarcity by proposing DA-FPS, a passive, model-agnostic sampling method guided by a data-driven bound that is linear in a weighted fill distance between training and data-source marginals. DA-FPS estimates data densities with adaptive -NN methods and greedily selects points to minimize an estimated weighted fill distance, thereby promoting coverage of sparse regions while preserving representation of dense regions. Theoretical results give a -approximation guarantee for the estimated objective, and extensive experiments on molecular-property datasets show DA-FPS consistently improves mean absolute error and robustness (vs. baselines like FPS and random sampling) across regression models, especially at larger training budgets. The approach advances data-driven coreset design by incorporating distributional alignment into a model-agnostic sampling objective, with practical impact for chemistry and other domains where labeling is costly.

Abstract

We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set: a quantity we can estimate simply by considering the data features. We introduce ''Density-Aware Farthest Point Sampling'' (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

Paper Structure

This paper contains 30 sections, 5 theorems, 52 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.2

Consider random variables $({\boldsymbol{X}}, Y)$ taking value on $\mathcal{X} \times \mathcal{Y} \subset \mathbb{R}^d \times \mathbb{R}$, with $\mathcal{X}$ bounded, data source distribution $p_{\mathcal{D}} \in \mathcal{P}$, labeled dataset $\mathcal{L} := \{({\boldsymbol{x}}_j, y_j)\}_{j=1}^b$ ar where $\mathbb{P}_{ \mathcal{L}}\left[p_{\mathcal{X}_{\mathcal{D}}} <p_{\mathcal{X}_{\mathcal{L}}}\

Figures (9)

  • Figure 1: Illustration of DA-FPS, FPS, and uniform random sampling on a synthetic 2D dataset. Top row: (a) A dataset with 1000 points in the unit square. The dataset consists of a high-density central cluster (650 points), a smaller lower-left cluster (200 points), and uniformly scattered points (150 points). The background shows a 2D kernel density estimation, where darker blue indicates higher density. Bottom row: 100 points selected by each method. (b) Uniform random sampling mostly selects from the dense cluster and may miss sparse regions. (c) FPS selects points evenly across the space, ignoring density. (d) DA-FPS selects more points from dense regions but still covers sparse areas, balancing density and coverage.
  • Figure 2: MAE for regression tasks on QM datasets using KRR with Gaussian kernel (top row) and FNN (bottom row) trained on sets of various sizes, expressed as a percentage of the available data points, and selected with different sampling strategies. Error bars represent the standard deviation over five runs. DA-FPS (red lines) outperforms the baselines. The legend in the top-row leftmost graph applies to all graphs. DA-FPS is initialized with $\mathcal{L}_{\mathcal{X}}= \emptyset$, $k=100$, and $u= 3\%$ of the available data, independently of the dataset.
  • Figure 3: RMSE for regression tasks on QM datasets using KRR with Gaussian kernel (top row) and FNN (bottom row) trained on sets of various sizes, expressed as a percentage of the available data points, and selected with different sampling strategies. Error bars represent the standard deviation over five runs. The performances of FPS (blue lines) may be close to that of DA-FPS (red lines) when we consider the RMSE, particularly for larger training set sizes, e.g., on QM7 and for larger set sizes on QM8. Nevertheless, DA-FPS still leads to the most competitive performances across datasets. The legend in the leftmost graph in the top row applies to all graphs. DA-FPS is initialized with $\mathcal{L}_{\mathcal{X}}= \emptyset$, $k=100$, and $u= 3\%$ of the available data, independently of the dataset.
  • Figure 4: Results for regression tasks on the Concrete Compressive Strength, Electrical Grid Stability, and QM8 datasets. We use KRR with the Cauchy kernel trained on sets of various sizes, expressed as a percentage of the available data points, and selected with different sampling strategies. MAE (top row), RMSE (middle row) and MAXAE (bottom row) are shown for each training set size and sampling approach. Error bars represent the standard deviation of the results over five runs. DA-FPS (red lines) consistently showcases competitive performances across all metrics. For MAE, DA-FPS generally outperforms other methods, except Twinning (black lines) at 5% training set size on QM8. Twinning is the second-best method in terms of the MAE. For the RMSE, DA-FPS consistently ranks as the best or second-best. MAXAE results confirm DA-FPS as the best or second-approach, with FPS (blue lines) as the other most competitive approach. As for Twinning, despite strong MAE performance, it under-performs in MAXAE, sometimes worse than random sampling (green lines). Overall, DA-FPS delivers competitive performances across all metrics. DA-FPS is initialized with $\mathcal{L}_{\mathcal{X}}= \emptyset$ and $u =$ 3%, 1% and 3% and $k =$ 100, 300 and 300 for the QM8, Concrete dataset and electricity dataset, respectively.
  • Figure 5: MAE for regression tasks on QM datasets using KRR with Gaussian kernel (top row) and FNN (bottom row) trained on sets of various sizes, expressed as a percentage of the available data points, and selected with different sampling strategies. Error bars represent the standard deviation over five runs. The modified versions of the baselines (dashed lines) lead to better performance than the respective original baselines (solid lines). The legend in the leftmost graph in the top row applies to all graphs. The modified baselines sample the first 3% of the points using FPS.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Definition 4.1
  • Theorem 4.2
  • Theorem 7.1
  • Remark A.1
  • proof
  • proof
  • Theorem D.1
  • proof
  • Corollary D.2
  • proof
  • ...and 3 more