On Sample Selection for Continual Learning: a Video Streaming Case Study

Alexander Dietmüller; Romain Jacob; Laurent Vanbever

On Sample Selection for Continual Learning: a Video Streaming Case Study

Alexander Dietmüller, Romain Jacob, Laurent Vanbever

TL;DR

The paper tackles the challenge of continual learning for ML-based adaptive bitrate in evolving networks by focusing on tail performance rather than just average metrics. It introduces Memento, a density-based sample-space coverage method that selects training samples from low-density regions to maximize coverage and trigger retraining only when new information arises. Through real-world deployment on the Puffer ABR system and synthetic-shift simulations, Memento achieves a 14% reduction in stalls with modest SSIM impact and demonstrates robust, architecture-agnostic benefits. The work situates density-based sampling as complementary to existing strategies like JTT and MatchMaker, offering practical improvements for tail reliability and providing artifacts for reproducibility and broader application in networking ML tasks.

Abstract

Machine learning (ML) is a powerful tool to model the complexity of communication networks. As networks evolve, we cannot only train once and deploy. Retraining models, known as continual learning, is necessary. Yet, to date, there is no established methodology to answer the key questions: With which samples to retrain? When should we retrain? We address these questions with the sample selection system Memento, which maintains a training set with the "most useful" samples to maximize sample space coverage. Memento particularly benefits rare patterns -- the notoriously long "tail" in networking -- and allows assessing rationally when retraining may help, i.e., when the coverage changes. We deployed Memento on Puffer, the live-TV streaming project, and achieved a 14% reduction of stall time, 3.5x the improvement of random sample selection. Finally, Memento does not depend on a specific model architecture; it is likely to yield benefits in other ML-based networking applications.

On Sample Selection for Continual Learning: a Video Streaming Case Study

TL;DR

Abstract

Paper Structure (83 sections, 7 equations, 25 figures, 2 tables, 1 algorithm)

This paper contains 83 sections, 7 equations, 25 figures, 2 tables, 1 algorithm.

Introduction
Main contributions
A case for density
Density for sample selection
Density for shift detection
Coverage maximization
Definitions
Process
Predictions
Replay memory
Sample selection strategy
Distance measurement
Batching
Distribution distances
Inputs and Outputs
...and 68 more sections

Figures (25)

Figure 1: On Puffer Puffer, retraining daily with random samples did not outperform a never-retrained model consistently. On average, image quality and stream-time spent stalled differ by less than 0.2% and 4.2%. Mean and 90% CI over a one-month sliding window. Data source: Puffer
Figure 2: More resources do not guarantee better tail performance. Using more random samples has virtually no effect. Using more models to select training samples (QBC) initially helps but degrades over time. Mean and 90% CI over a two-week sliding window (see \ref{['sec:puffer']} for details).
Figure 3: Memento maximizes sample space coverage, improving the tail while rationalizing when to retrain.
Figure 4: Loss improvement obtained by retaining with 1M samples over a dataset of 5M. The same batches of 256 samples are used for the loss- (left) and density-based selection (right). To improve tail performance, we need many low-density batches because they are all different. To maintain average performance, we need only a few high-density batches, as they are similar. Selecting based on density (right) achieves both. Conversely, loss-based selection (left) is too specific. It suffers from diminishing returns by selecting too many high-loss batches and catastrophically forgets the average.
Figure 5: Selecting samples based on the Euclidean distance between batch averages does not improve tail performance. Computing the Jensen-Shanning distance between batch distributions is a better choice. Mean and 90% CI over a two-week sliding window (see \ref{['sec:puffer']} for details).
...and 20 more figures

Theorems & Definitions (3)

Definition 1: Jensen-Shannon Distance
Definition 2: KDE
Definition 3: Coverage increase

On Sample Selection for Continual Learning: a Video Streaming Case Study

TL;DR

Abstract

On Sample Selection for Continual Learning: a Video Streaming Case Study

Authors

TL;DR

Abstract

Table of Contents

Figures (25)

Theorems & Definitions (3)