Table of Contents
Fetching ...

Test-Time Training on Video Streams

Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

TL;DR

The paper extends Test-Time Training (TTT) to streaming video by proposing online TTT with both implicit memory (parameter carryover) and explicit memory (a sliding window of recent frames). It demonstrates that training on a short-term, recent frame window yields substantial gains across semantic, instance, and panoptic segmentation, as well as video colorization, outperforming both fixed-model baselines and offline training on entire videos. Using TTT-MAE as the inner loop, the approach shows strong results on KITTI-STEP, COCO Videos, and Lumière films, highlighting the importance of locality and temporal smoothness. The authors provide empirical ablations and a bias-variance theoretical analysis that identifies a sweet spot for the memory window size and discusses broader implications for continual learning and test-time adaptation in real-world video settings.

Abstract

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off.

Test-Time Training on Video Streams

TL;DR

The paper extends Test-Time Training (TTT) to streaming video by proposing online TTT with both implicit memory (parameter carryover) and explicit memory (a sliding window of recent frames). It demonstrates that training on a short-term, recent frame window yields substantial gains across semantic, instance, and panoptic segmentation, as well as video colorization, outperforming both fixed-model baselines and offline training on entire videos. Using TTT-MAE as the inner loop, the approach shows strong results on KITTI-STEP, COCO Videos, and Lumière films, highlighting the importance of locality and temporal smoothness. The authors provide empirical ablations and a bias-variance theoretical analysis that identifies a sweet spot for the memory window size and discusses broader implications for continual learning and test-time adaptation in real-world video settings.

Abstract

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off.
Paper Structure (27 sections, 18 equations, 9 figures, 4 tables)

This paper contains 27 sections, 18 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Results for instance and panoptic segmentation on COCO Videos, and semantic segmentation on the KITTI-STEP test set. Online TTT-MAE in the streaming setting achieves the best performance (green) in all tasks. Offline TTT-MAE on all frames (yellow) requires the rather unrealistic setting that makes the entire test video available before making predictions. We think of this as "training on all possible futures". See more about this method in Subsection \ref{['forget-ablate']}. Online TTT still performs better than offline, by taking advantage of locality.
  • Figure 2: In our streaming setting, the current model $f_t$ makes a prediction on the current frame before it can see the next one. $f_t$ is obtained through online TTT, initializing from the previous model $f_{t-1}$. A sliding window of size $k$ contains the current and previous frames as test-time training data for the self-supervised task. Concretely, $k = 16$ gives a window of only 1.6 seconds in our experiments.
  • Figure 3: Training a masked autoencoder (MAE) to reconstruct each test image at test time. Reconstructed images on the right visualize the progress of gradient descent on this one-sample learning problem. For each test image, TTT-MAE ttt-mae first masks out majority of the patches. The masked image is given as input to the autoencoder, which then reconstructs those masked patches. The reconstruction loss is the pixel-wise mean squared error between the original and reconstructed patches. Loss on the main task -- panoptic segmentation -- also falls as reconstruction gets better. The unmasked patches are not shown on the right since they are not part of the reconstruction loss.
  • Figure 4: Random frames from COCO Videos (left) and their labels for panoptic segmentation (right).
  • Figure 5: Effect of window size $k$ on performance. The x-axis is in log-scale. The plot for KITTI-STEP is on the validation set, where we selected the optimal hyper-parameter $k=16$. For all three tasks, with a rate of 10 frames per second, 16 frames cover only 1.6 seconds. In simple terms, our algorithm actually prefers a very short-term memory. The optimal $k$ on COCO Videos turns out to be different for both semantic and panoptic segmentation, but the results we report in Table \ref{['tab-main']} still use $k=16$. For all window sizes, the batch size, and therefore computational cost, is fixed for TTT-MAE.
  • ...and 4 more figures