Prediction-Oriented Subsampling from Data Streams
Benedetta Lavinia Mussati, Freddie Bickford Smith, Tom Rainforth, Stephen Roberts
TL;DR
This paper tackles subsampling from data streams to enable efficient offline learning by focusing on information that improves downstream predictions. It introduces EPIG (and the label-aware LA-EPIG) as principled data-utility criteria, and provides practical estimation via likelihood reweighting to maximize predictive information gain. Empirical results on Split MNIST and Split CIFAR-10 show prediction-focused subsampling can outperform prior information-theoretic baselines, though success hinges on careful model architecture and uncertainty estimation. The work highlights the need for scalable subsampling algorithms and better practices for building models that support reliable uncertainty estimates in data-stream settings.
Abstract
Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.
