Density-Aware Farthest Point Sampling
Paolo Climaco, Jochen Garcke
TL;DR
This work tackles data-efficient regression under label scarcity by proposing DA-FPS, a passive, model-agnostic sampling method guided by a data-driven bound that is linear in a weighted fill distance between training and data-source marginals. DA-FPS estimates data densities with adaptive $k$-NN methods and greedily selects points to minimize an estimated weighted fill distance, thereby promoting coverage of sparse regions while preserving representation of dense regions. Theoretical results give a $2k$-approximation guarantee for the estimated objective, and extensive experiments on molecular-property datasets show DA-FPS consistently improves mean absolute error and robustness (vs. baselines like FPS and random sampling) across regression models, especially at larger training budgets. The approach advances data-driven coreset design by incorporating distributional alignment into a model-agnostic sampling objective, with practical impact for chemistry and other domains where labeling is costly.
Abstract
We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set: a quantity we can estimate simply by considering the data features. We introduce ''Density-Aware Farthest Point Sampling'' (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.
