Sample Compression Unleashed: New Generalization Bounds for Real Valued Losses
Mathieu Bazinet, Valentina Zantedeschi, Pascal Germain
TL;DR
This work addresses the challenge of obtaining generalization guarantees for real-valued and unbounded losses within the sample-compression paradigm. It develops a general PAC-Bayes–inspired bound for real-valued losses using a comparator framework, yielding Catoni-type and KL-based bounds, plus a sub-Gaussian unbounded-loss bound, all independently of the model size. Central to the approach is Pick-To-Learn (P2L), a model-agnostic meta-algorithm that converts any predictor into a sample-compressed predictor by incrementally building a compression set and retraining, enabling tight, data-efficient generalization certificates on deep nets, random forests, and NLP models like DistilBERT. The empirical results across Binary MNIST, MNIST, regression with trees, and Amazon polarity demonstrate non-vacuous, tight bounds that scale with compression size rather than parameter count, highlighting the practical impact of certificate-based generalization in real-valued loss settings.
Abstract
The sample compression theory provides generalization guarantees for predictors that can be fully defined using a subset of the training dataset and a (short) message string, generally defined as a binary sequence. Previous works provided generalization bounds for the zero-one loss, which is restrictive notably when applied to deep learning approaches. In this paper, we present a general framework for deriving new sample compression bounds that hold for real-valued unbounded losses. Using the Pick-To-Learn (P2L) meta-algorithm, which transforms the training method of any machine-learning predictor to yield sample-compressed predictors, we empirically demonstrate the tightness of the bounds and their versatility by evaluating them on random forests and multiple types of neural networks.
