Table of Contents
Fetching ...

Prediction-Oriented Subsampling from Data Streams

Benedetta Lavinia Mussati, Freddie Bickford Smith, Tom Rainforth, Stephen Roberts

TL;DR

This paper tackles subsampling from data streams to enable efficient offline learning by focusing on information that improves downstream predictions. It introduces EPIG (and the label-aware LA-EPIG) as principled data-utility criteria, and provides practical estimation via likelihood reweighting to maximize predictive information gain. Empirical results on Split MNIST and Split CIFAR-10 show prediction-focused subsampling can outperform prior information-theoretic baselines, though success hinges on careful model architecture and uncertainty estimation. The work highlights the need for scalable subsampling algorithms and better practices for building models that support reliable uncertainty estimates in data-stream settings.

Abstract

Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.

Prediction-Oriented Subsampling from Data Streams

TL;DR

This paper tackles subsampling from data streams to enable efficient offline learning by focusing on information that improves downstream predictions. It introduces EPIG (and the label-aware LA-EPIG) as principled data-utility criteria, and provides practical estimation via likelihood reweighting to maximize predictive information gain. Empirical results on Split MNIST and Split CIFAR-10 show prediction-focused subsampling can outperform prior information-theoretic baselines, though success hinges on careful model architecture and uncertainty estimation. The work highlights the need for scalable subsampling algorithms and better practices for building models that support reliable uncertainty estimates in data-stream settings.

Abstract

Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.

Paper Structure

This paper contains 23 sections, 14 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Data prioritisation (darker shades indicate higher values) varies substantially between the expected predictive information gain (EPIG; bickfordsmith2023prediction), a label-aware variant (LA-EPIG) and the "memorable information criterion" (MIC; sun2022information). LA-EPIG and MIC are evaluated on "true" labels, $y_\mathrm{true} = \arg\max_{y} p_\mathrm{true}(y|x)$, as well as "flipped" labels, $y_\mathrm{flip} = 1 - y_\mathrm{true}$.
  • Figure 2: Capitalising on unlabelled data leads to stronger baseline predictive performance, but that does not necessarily translate to better data subsampling. Here a semi-supervised model (comprising an unsupervised encoder and a supervised prediction head) outperforms a fully supervised model when both models are trained on randomly subsampled data. At the same time, the benefit of intelligent subsampling (with EPIG) that we see for the fully supervised model does not hold for the semi-supervised model.
  • Figure 3: The benefit of intelligent subsampling for training a semi-supervised model depends strongly on the construction of the model. Here intelligent subsampling has an adverse effect on predictive performance for a model that uses a dropout MLP for its prediction head, but EPIG-based subsampling has a positive effect for a model that uses a random forest in place of the dropout MLP.
  • Figure 4: Aligning with results on Split MNIST (\ref{['fig:rf_vs_mlp_mnist']}), the performance of intelligent subsampling on Split CIFAR-10 shows a strong dependence on model construction. Here EPIG-based subsampling is beneficial for one model but not the other.
  • Figure 5: Results for the setup presented in \ref{['fig:supervised_vs_semisupervised']} except with data-store size $m=250$.
  • ...and 5 more figures