Label Delay in Online Continual Learning
Botos Csaba, Wenxuan Zhang, Matthias Müller, Ser-Nam Lim, Mohamed Elhoseiny, Philip Torr, Adel Bibi
TL;DR
This work tackles label delay in online continual learning by formalizing a setting with two data streams: unlabeled current inputs and delayed labels arriving after $d$ steps, under a fixed compute budget $\\mathcal{C}$. It shows that naive training on delayed labels degrades non-trivially as $d$ grows, and that standard SSL, S4L, or TTA approaches under the same compute budgets do not reliably outperform this baseline. The authors propose a simple, efficient baseline called Importance Weighted Memory Sampling (IWMS), which replays memory samples most similar to the new unlabeled data by matching predicted labels and embedding similarity, achieving substantial improvements across large-scale datasets (CLOC, CGLM, FMoW, Yearbook) and sometimes closing the gap to non-delayed performance. They also analyze the impact of label delay on compute scaling and provide extensive ablations showing the robustness of IWMS to memory size and sampling choices, offering practical guidance for deploying learning systems under annotation latency constraints.
Abstract
Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time steps. In each step, the framework reveals both unlabeled data from the current time step $t$ and labels delayed with $d$ steps, from the time step $t-d$. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a naïve method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.
