Table of Contents
Fetching ...

Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins

TL;DR

The paper tackles the challenge of extracting optical flow from videos without task-specific training by leveraging frozen, large-scale generative video models. It identifies three crucial model properties for reliable zero-shot flow: distributional future predictions, local patch-level representation, and random-access decoding, targeting the LRAS architecture. The authors introduce KL-tracing, a test-time inference method that uses logit-space perturbations to compute flow, achieving state-of-the-art results on TAP-Vid DAVIS and Kubric without flow-specific fine-tuning. They demonstrate that LRAS combined with KL-tracing can outperform specialized flow baselines on challenging real-world and synthetic datasets, highlighting a scalable zero-shot path for accurate motion estimation. This work suggests a broader shift toward prompting controllable generative video models to extract diverse visual intermediates without labeled data.

Abstract

Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models for zero-shot flow extraction. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recently introduced Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time inference procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method is competitive with state-of-the-art, task-specific models on the real-world TAP-Vid DAVIS benchmark and the synthetic TAP-Vid Kubric. Our results show that counterfactual prompting of controllable generative video models is an effective alternative to supervised or photometric-loss methods for high-quality flow.

Taming generative video models for zero-shot optical flow extraction

TL;DR

The paper tackles the challenge of extracting optical flow from videos without task-specific training by leveraging frozen, large-scale generative video models. It identifies three crucial model properties for reliable zero-shot flow: distributional future predictions, local patch-level representation, and random-access decoding, targeting the LRAS architecture. The authors introduce KL-tracing, a test-time inference method that uses logit-space perturbations to compute flow, achieving state-of-the-art results on TAP-Vid DAVIS and Kubric without flow-specific fine-tuning. They demonstrate that LRAS combined with KL-tracing can outperform specialized flow baselines on challenging real-world and synthetic datasets, highlighting a scalable zero-shot path for accurate motion estimation. This work suggests a broader shift toward prompting controllable generative video models to extract diverse visual intermediates without labeled data.

Abstract

Extracting optical flow from videos remains a core computer vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data is scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models for zero-shot flow extraction. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recently introduced Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time inference procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method is competitive with state-of-the-art, task-specific models on the real-world TAP-Vid DAVIS benchmark and the synthetic TAP-Vid Kubric. Our results show that counterfactual prompting of controllable generative video models is an effective alternative to supervised or photometric-loss methods for high-quality flow.

Paper Structure

This paper contains 14 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We introduce a zero-shot test-time inference procedure called KL-tracing, which extracts robust optical flow and point tracking from a generative video model on challenging in-the-wild videos. In every column, the green line links the query location in the first frame (top) to the position predicted by our method in the second frame (bottom). All clips are real-world internet videos and contain phenomena that classical, appearance-based optical flow methods find challenging: (A) Newton's cradle, where both frames have four balls in the middle, but the balls are different; the example involves physical reasoning. (B) Globe has challenging in-place object rotation and the query point is in the textureless ocean. (C) Dog weaving through occluding poles with large, rapid motion, including depth changes and motion blur. (D) Soccer tackle with fast, diagonal motion with motion blur and partial occlusion. (E) Windmill rotation where the repetitive blades and uniform sky make local matching challenging. These examples highlight the benefits of leveraging a powerful video model to extract optical flow for challenging real-world scene dynamics.
  • Figure 2: Test-time inference procedure for extracting flow from a pre-trained, frozen, generative video model, based on the Counterfactual World Model (CWM) paper bear_unifying_2023. This involves three steps: (1) Perturbation: add a small, white-colored 2D Gaussian dot perturbation to frame 1 at the location of the point we wish to track. (2) Generate model predictions conditioned on the two frames. For CWM, Cosmos, and LRAS, we provide frame 1 and masked patches of frame 2 (Sections \ref{['section:deterministic_cwm']}, \ref{['section:cosmos']}, \ref{['section:lras']}). For Stable Video Diffusion, we provide the noised latents of both frames (Section \ref{['section:stable_video_diffusion']}). (3) Estimate optical flow by computing the RGB difference between the clean and perturbed predictions.
  • Figure 3: KL-tracing, our novel yet simple test-time inference procedure for extracting optical flow from controllable generative models such as LRAS. We follow the same steps for perturbation and conditioned prediction as in Figure \ref{['fig:general_method']}, but estimate optical flow by computing the KL divergence between the clean and perturbed prediction logits.
  • Figure 4: Our method, KL-tracing using LRAS extracts better flow than other generative video models. (A) Deterministic models, such as CWM bear_unifying_2023, often produce blurry predictions as they model a single, average future state. (B) Stable Video Diffusion lacks fine-grained controllability due to its coarse global latent code. Its clean and perturbed predictions differ in locations where the perturbation is not supposed to be carried to. (C) The Cosmos autoregressive world model lacks fine-grained controllability as it does not utilize pointers to denote the position of each token, making it challenging to prompt for flow extraction. (D) The LRAS model is highly controllable and has minimal differences between the clean and perturbed predictions. We use KL-tracing to compute the difference in logit instead of RGB space, obtaining sharp flow extractions.
  • Figure 5: KL-divergence of prediction distributions bypasses noisy RGB differences resulting from sampling randomness. Computing the KL divergence of the clean and perturbed prediction logits (last column) is more efficient yet functionally similar to computing the average RGB difference over many samples (second last column).
  • ...and 2 more figures