Using Low-Discrepancy Points for Data Compression in Machine Learning: An Experimental Comparison
Simone Göttlich, Jacob Heieck, Andreas Neuenkirch
TL;DR
This work investigates data reduction for regression and neural-network training using low-discrepancy points (Quasi-Monte Carlo). It compares two QMC-based compression schemes (QMC-averaging and QMC-Voronoi) to the adaptive supercompress method, highlighting deterministic error bounds for the QMC approaches and empirical performance across synthetic test functions and MNIST. The results show that adaptive clustering via the standard supercompress approach consistently outperforms the QMC methods on real-world, high-dimensional data, while QMC-Voronoi offers competitive performance on simple, regular problems but fails to scale to MNIST. The findings suggest that for complex data, output-space–focused clustering with adaptive refinement provides the most reliable compression for maintaining predictive accuracy while reducing training cost, whereas QMC-based guarantees are most beneficial in regular settings.
Abstract
Low-discrepancy points (also called Quasi-Monte Carlo points) are deterministically and cleverly chosen point sets in the unit cube, which provide an approximation of the uniform distribution. We explore two methods based on such low-discrepancy points to reduce large data sets in order to train neural networks. The first one is the method of Dick and Feischl [4], which relies on digital nets and an averaging procedure. Motivated by our experimental findings, we construct a second method, which again uses digital nets, but Voronoi clustering instead of averaging. Both methods are compared to the supercompress approach of [14], which is a variant of the K-means clustering algorithm. The comparison is done in terms of the compression error for different objective functions and the accuracy of the training of a neural network.
