Table of Contents
Fetching ...

Learning from Streaming Video with Orthogonal Gradients

Tengda Han, Dilara Gokay, Joseph Heyward, Chuhan Zhang, Daniel Zoran, Viorica Pătrăucean, João Carreira, Dima Damen, Andrew Zisserman

TL;DR

This paper tackles learning from streaming video, where consecutive batches exhibit high gradient correlation and violate IID assumptions. It introduces orthogonal gradients, computed as the component of the current gradient orthogonal to a smoothed history $c_t$ via $u_t = g_t - \text{proj}_{c_{t-1}}(g_t)$ with $c_t = \beta c_{t-1} + (1-\beta) g_t$, enabling an Orthogonal-AdamW optimizer that preserves informative updates while decorrelating temporally correlated gradients. Across three tasks—DoRA on a single long video, VideoMAE on multi-video datasets, and future frame prediction on streams—the orthogonal optimizer consistently improves performance over standard AdamW, including scenarios with sequential data. The results demonstrate improved representation learning and test-time adaptation capabilities, suggesting practical benefits for streaming-video applications and resource-constrained settings where random access to data is limited. However, the gains are domain-dependent, as ImageNet solo classification shows limited or negative benefits, underscoring the need to align optimization strategies with data distribution characteristics.

Abstract

We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.

Learning from Streaming Video with Orthogonal Gradients

TL;DR

This paper tackles learning from streaming video, where consecutive batches exhibit high gradient correlation and violate IID assumptions. It introduces orthogonal gradients, computed as the component of the current gradient orthogonal to a smoothed history via with , enabling an Orthogonal-AdamW optimizer that preserves informative updates while decorrelating temporally correlated gradients. Across three tasks—DoRA on a single long video, VideoMAE on multi-video datasets, and future frame prediction on streams—the orthogonal optimizer consistently improves performance over standard AdamW, including scenarios with sequential data. The results demonstrate improved representation learning and test-time adaptation capabilities, suggesting practical benefits for streaming-video applications and resource-constrained settings where random access to data is limited. However, the gains are domain-dependent, as ImageNet solo classification shows limited or negative benefits, underscoring the need to align optimization strategies with data distribution characteristics.

Abstract

We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.

Paper Structure

This paper contains 37 sections, 5 equations, 4 figures, 10 tables, 2 algorithms.

Figures (4)

  • Figure 1: We address the task of learning from video by sequentially loading its clips in time (top). As neighbouring clips are very similar, consecutive gradients are highly correlated -- we show the histogram of cosine similarity of gradients between consecutive batches. This causes model collapse. In contrast, current methods shuffle the video to simulate an IID input (middle). Consecutive gradients are accordingly decorrelated -- cosine similarity is centred around 0. We propose to learn from the orthogonal gradients -- which allow standard optimizers to recover the drop in performance when training from a sequential video stream (bottom).
  • Figure 2: A simplified illustration of orthogonal gradients. (a) In common IID training, the gradient between consecutive steps are not very correlated due to the IID nature. (b) Whereas if learning from sequential videos, the gradients between consecutive steps are highly correlated, which harms the optimization. We propose to update the model parameters from the orthogonal components of the current gradient, denoted as $u_t$. In practice, the gradients and the orthogonal operation are in a high dimensional space.
  • Figure 3: Effect of orthogonal optimizer on sequential training of DoRA on the $\text{WT}_{\text{Venice}}$ video. On IID training, the consecutive gradient has low cosine similarity (right). Sequential training (left) naturally brings a high similarity of consecutive gradient, but the orthogonal optimizer decorrelate the gradients over time. Notice that we plot $\cos{(g_{t-1}, g_t)}$ in this figure.
  • Figure 4: Two batch strategies for sequential video datasets, for videos $\text{V}_i$ divided into clips $\{\text{C}_1, ... \text{C}_N\}$. (a) batch along the time axis: a more practical way of batching long video streams, where the samples within a batch have high correlation. But when the batch size is large, the temporal correlation between consecutive batches might be low. (b) batch along videos: samples within a batch are diverse but the temporal correlation between consecutive batches is high. Notice that in practice adjacent clips may have temporal overlaps, for clarity we do not show any overlaps in the figure.