Label Delay in Online Continual Learning

Botos Csaba; Wenxuan Zhang; Matthias Müller; Ser-Nam Lim; Mohamed Elhoseiny; Philip Torr; Adel Bibi

Label Delay in Online Continual Learning

Botos Csaba, Wenxuan Zhang, Matthias Müller, Ser-Nam Lim, Mohamed Elhoseiny, Philip Torr, Adel Bibi

TL;DR

This work tackles label delay in online continual learning by formalizing a setting with two data streams: unlabeled current inputs and delayed labels arriving after $d$ steps, under a fixed compute budget $\\mathcal{C}$. It shows that naive training on delayed labels degrades non-trivially as $d$ grows, and that standard SSL, S4L, or TTA approaches under the same compute budgets do not reliably outperform this baseline. The authors propose a simple, efficient baseline called Importance Weighted Memory Sampling (IWMS), which replays memory samples most similar to the new unlabeled data by matching predicted labels and embedding similarity, achieving substantial improvements across large-scale datasets (CLOC, CGLM, FMoW, Yearbook) and sometimes closing the gap to non-delayed performance. They also analyze the impact of label delay on compute scaling and provide extensive ablations showing the robustness of IWMS to memory size and sampling choices, offering practical guidance for deploying learning systems under annotation latency constraints.

Abstract

Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time steps. In each step, the framework reveals both unlabeled data from the current time step $t$ and labels delayed with $d$ steps, from the time step $t-d$. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a naïve method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.

Label Delay in Online Continual Learning

TL;DR

This work tackles label delay in online continual learning by formalizing a setting with two data streams: unlabeled current inputs and delayed labels arriving after

steps, under a fixed compute budget

. It shows that naive training on delayed labels degrades non-trivially as

grows, and that standard SSL, S4L, or TTA approaches under the same compute budgets do not reliably outperform this baseline. The authors propose a simple, efficient baseline called Importance Weighted Memory Sampling (IWMS), which replays memory samples most similar to the new unlabeled data by matching predicted labels and embedding similarity, achieving substantial improvements across large-scale datasets (CLOC, CGLM, FMoW, Yearbook) and sometimes closing the gap to non-delayed performance. They also analyze the impact of label delay on compute scaling and provide extensive ablations showing the robustness of IWMS to memory size and sampling choices, offering practical guidance for deploying learning systems under annotation latency constraints.

Abstract

and labels delayed with

steps, from the time step

. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a naïve method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.

Paper Structure (27 sections, 19 figures, 2 algorithms)

This paper contains 27 sections, 19 figures, 2 algorithms.

Introduction
Related Work
Problem Formulation
IWMS: Importance Weighted Memory Sampling
The Cost of Ignoring Label Delay
Experimental Setup
Observations
Section Conclusion
Utilising Data Prior to Label Arrival
Experiment Setup
Observations
Analysis of Importance Weighted Memory Sampling
Conclusion and Future Work
Acknowledgement
Supplementary Material
...and 12 more sections

Figures (19)

Figure 1: Illustration of label delay. This figure shows a typical Continual Learning (CL) setup with label delay due to annotation. At every time step $t$, the data stream $\mathcal{S}_\mathcal{X}$ reveals a batch of unlabeled data $\{x^t\}$, on which the model $f_\theta$ is evaluated (highlighted with green borders). The data is then sent to the annotator $\mathcal{S}_\mathcal{Y}$ who takes $d$ time steps to provide the corresponding labels. Consequently, at time step $t$ the batch of labels $\{y^{t-d}\}$ corresponding to the input data from $d$ time steps before becomes available. The CL model can be trained using the delayed labeled data (shown in color) and the newest unlabeled data (shown in grayscale). In this example, the stream reveals three samples at each time step and the annotation delay is $d=2$.
Figure 2: Effects of Varying Label Delay. The performance of a Naïve Online Continual Learner model gradually degrades with increasing values of delay $d$.
Figure 3: Comparison of various unsupervised methods. The accuracy gap caused by the label delay between the Naïve without delay and its delayed counterpart Naïve. Our proposed method, IWMS, consistently outperforms all categories under all delay settings on three out of four datasets.
Figure 4: Backward transfer. Measuring forgetting on the withheld validation set.
Figure 5: Effect of sampling strategy (left), memory sizes (right). We report the Online Accuracy under the least (top: $d=10$) and the most challenging (bottom: $d=100$) label delay scenarios on CGLM prabhu2023online.
...and 14 more figures

Label Delay in Online Continual Learning

TL;DR

Abstract

Label Delay in Online Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)