Table of Contents
Fetching ...

Learning from One Continuous Video Stream

João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman

TL;DR

This work investigates learning from a single continuous video stream, a setting with high temporal correlation and no minibatches, by introducing a unified pixel-to-pixel framework that supports multiple tasks through RGB-space targets. It evaluates both in-stream adaptation and out-of-stream generalization using two long video streams (Ego4D-stream and ScanNet-stream) and a family of future-prediction pretraining tasks. The key findings show that momentum-free optimizers (e.g., RMSProp), less frequent weight updates, and pretraining on IID data with future-prediction objectives yield substantial gains, with Baby Learning rivaling IID batch-size-1 performance for generalization and surpassing it for adaptation. This approach offers a practical path toward on-device, privacy-friendly continual learning from continuous sensory streams, with implications for embodied AI and personalized digital assistants.

Abstract

We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.

Learning from One Continuous Video Stream

TL;DR

This work investigates learning from a single continuous video stream, a setting with high temporal correlation and no minibatches, by introducing a unified pixel-to-pixel framework that supports multiple tasks through RGB-space targets. It evaluates both in-stream adaptation and out-of-stream generalization using two long video streams (Ego4D-stream and ScanNet-stream) and a family of future-prediction pretraining tasks. The key findings show that momentum-free optimizers (e.g., RMSProp), less frequent weight updates, and pretraining on IID data with future-prediction objectives yield substantial gains, with Baby Learning rivaling IID batch-size-1 performance for generalization and surpassing it for adaptation. This approach offers a practical path toward on-device, privacy-friendly continual learning from continuous sensory streams, with implications for embodied AI and personalized digital assistants.

Abstract

We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers.
Paper Structure (17 sections, 21 figures, 4 tables)

This paper contains 17 sections, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Top: We introduce a framework for studying continuous learning in a single video stream. This is a natural yet unstudied problem, different from standard independent and identically distributed (IID) learning in video where batches contain clips from random videos in a random order. Bottom: We propose pixel-to-pixel models to evaluate our approach across prediction tasks (prediction of future frames, depth, segmentation). We measure both adaptation to the video stream -- the model here updates its weights (learns) continuously to improve prediction -- as well as generalization to out-of-stream clips -- the model being adapted on the first stream is now evaluated on a different held-out stream without being allowed to adapt to it. We propose to maximize both adaptation and generalization.
  • Figure 2: UNet training on ScanNet-stream for the semantic segmentation task. Left: The cosine similarity of consecutive gradients is normally distributed when training on IID data, but shows very strong correlations when training on a continuous video stream. Right: This is reflected in poor training performance. See the appendix for similar figures for Ego4D-stream.
  • Figure 3: Video pretraining tasks we consider, sorted from easiest to hardest, left to right -- guided future prediction, vanilla future prediction, and masked future prediction. Each column shows 4 consecutive frames vertically. For each method we show left-to-right: input frames, predictions from the model, target frames. We use a displacement ($\Delta$) of 16 frames (0.64s) between input and target clips.
  • Figure 4: A sweep over commonly used optimizers. Those without momentum are shown in blue and aid the models adaptability considerably compared to the more commonly used Adam variants, which are shown in red.
  • Figure 5: Reducing momentum with the AdamW optimizer helps to recover some of the performance of RMSProp.
  • ...and 16 more figures