Predictive Coresets
Bernardo Flores
TL;DR
The paper addresses the scalability of Bayesian inference on massive datasets by replacing likelihood-based coreset weighting with matching posterior predictive distributions. It introduces predictive coresets, built via a DP-based proxy for posterior predictives and an optimal-transport–style transformation that maps full-data observations to a small weighted subset; the approach is model-agnostic and extends to nonparametric and non-Euclidean settings. The authors provide theoretical guarantees through posterior contraction rates in Wasserstein spaces and demonstrate the method on density estimation, logistic regression, and random partitions, with adaptive extensions to accelerate hyperparameter exploration. The combination of predictive-distribution matching, OT-based transport, and DP priors yields a flexible, scalable framework with practical benefits for large-scale Bayesian analysis and nontraditional data spaces. The work contributes a principled, transport-based coreset construction with convergence guarantees and actionable algorithms for real-world, complex Bayesian modeling tasks.
Abstract
Modern data analysis often involves massive datasets with hundreds of thousands of observations, making traditional inference algorithms computationally prohibitive. Coresets are selection methods designed to choose a smaller subset of observations while maintaining similar learning performance. Conventional coreset approaches determine these weights by minimizing the Kullback-Leibler (KL) divergence between the likelihood functions of the full and weighted datasets; as a result, this makes them ill-posed for nonparametric models, where the likelihood is often intractable. We propose an alternative variational method which employs randomized posteriors and finds weights to match the unknown posterior predictive distributions conditioned on the full and reduced datasets. Our approach provides a general algorithm based on predictive recursions suitable for nonparametric priors. We evaluate the performance of the proposed coreset construction on diverse problems, including random partitions and density estimation.
