Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim; Khai Loong Aw; Klemen Kotar; Cristobal Eyzaguirre; Wanhee Lee; Yunong Liu; Jared Watrous; Stefan Stojanov; Juan Carlos Niebles; Jiajun Wu; Daniel L. K. Yamins

Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins

TL;DR

The paper tackles the challenge of extracting optical flow from videos without task-specific training by leveraging frozen, large-scale generative video models. It identifies three crucial model properties for reliable zero-shot flow: distributional future predictions, local patch-level representation, and random-access decoding, targeting the LRAS architecture. The authors introduce KL-tracing, a test-time inference method that uses logit-space perturbations to compute flow, achieving state-of-the-art results on TAP-Vid DAVIS and Kubric without flow-specific fine-tuning. They demonstrate that LRAS combined with KL-tracing can outperform specialized flow baselines on challenging real-world and synthetic datasets, highlighting a scalable zero-shot path for accurate motion estimation. This work suggests a broader shift toward prompting controllable generative video models to extract diverse visual intermediates without labeled data.

Abstract

Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models for zero-shot flow extraction. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recently introduced Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time inference procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method is competitive with state-of-the-art, task-specific models on the real-world TAP-Vid DAVIS benchmark and the synthetic TAP-Vid Kubric. Our results show that counterfactual prompting of controllable generative video models is an effective alternative to supervised or photometric-loss methods for high-quality flow.

Taming generative video models for zero-shot optical flow extraction

TL;DR

Abstract

Taming generative video models for zero-shot optical flow extraction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)