PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling
Foo Hui-Mean, Yuan-chin Ivan Chang
TL;DR
PCA-QS introduces a principled, interpretable subsampling approach that preserves both the statistical distribution and geometric structure of large-scale data by guiding sampling with the leading principal components while keeping the original feature space intact. The method achieves theoretical guarantees, including uniform convergence of quantiles and fast decay of KL divergence, with a complementary slower but geometry-sensitive Wasserstein convergence due to reduced dimensionality. It provides concrete guidelines for selecting the number of principal components and quantile bins, and demonstrates consistent superiority over simple random sampling across synthetic and real-world datasets, improving downstream model performance. The framework is shown to be scalable through randomized or incremental PCA and efficient quantile computation, making it practical for modern ML pipelines. These contributions position PCA-QS as a robust, scalable, and theoretically grounded solution for efficient data summarization in large-scale learning tasks.
Abstract
We introduce Principal Component Analysis guided Quantile Sampling (PCA QS), a novel sampling framework designed to preserve both the statistical and geometric structure of large scale datasets. Unlike conventional PCA, which reduces dimensionality at the cost of interpretability, PCA QS retains the original feature space while using leading principal components solely to guide a quantile based stratification scheme. This principled design ensures that sampling remains representative without distorting the underlying data semantics. We establish rigorous theoretical guarantees, deriving convergence rates for empirical quantiles, Kullback Leibler divergence, and Wasserstein distance, thus quantifying the distributional fidelity of PCA QS samples. Practical guidelines for selecting the number of principal components, quantile bins, and sampling rates are provided based on these results. Extensive empirical studies on both synthetic and real-world datasets show that PCA QS consistently outperforms simple random sampling, yielding better structure preservation and improved downstream model performance. Together, these contributions position PCA QS as a scalable, interpretable, and theoretically grounded solution for efficient data summarization in modern machine learning workflows.
