Table of Contents
Fetching ...

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins

TL;DR

Opt-CWM addresses real-world motion estimation by replacing hand-crafted, domain-specific perturbations with a learnable counterfactual perturbation generator and bootstrapping training that couples flow estimation to next-frame prediction without using labeled data. It leverages Counterfactual World Modeling with an asymmetric masking-trained RGB-conditioned predictor and introduces Gaussian perturbations conditioned on local appearance, learned via end-to-end reconstruction loss with a flow-conditioned predictor. The approach achieves state-of-the-art performance on TAP-Vid First, demonstrates robustness to large frame gaps, and can be distilled into efficient RAFT-family architectures for faster inference. These results highlight a scalable path to self-supervised, counterfactual-based extraction of motion and related visual properties from unrestricted video data.

Abstract

Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

TL;DR

Opt-CWM addresses real-world motion estimation by replacing hand-crafted, domain-specific perturbations with a learnable counterfactual perturbation generator and bootstrapping training that couples flow estimation to next-frame prediction without using labeled data. It leverages Counterfactual World Modeling with an asymmetric masking-trained RGB-conditioned predictor and introduces Gaussian perturbations conditioned on local appearance, learned via end-to-end reconstruction loss with a flow-conditioned predictor. The approach achieves state-of-the-art performance on TAP-Vid First, demonstrates robustness to large frame gaps, and can be distilled into efficient RAFT-family architectures for faster inference. These results highlight a scalable path to self-supervised, counterfactual-based extraction of motion and related visual properties from unrestricted video data.

Abstract

Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

Paper Structure

This paper contains 30 sections, 3 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Extracting flow and occlusion with counterfactual perturbation: (A) CWMs learn to predict the next frame with a temporally factored masking policy bear2023unifying. (B) The motion of a point can be estimated using a simple counterfactual probing program FLOW: the model predicts the next frame with and without a local perturbation placed on the point, and the difference image between the clean and perturbed predictions reveals the estimated motion. (C) Occlusion is estimated using a related probe OCC: when the perturbation difference image is diffuse and low magnitude, that indicates the perturbed point has been occluded.
  • Figure 2: Parameterizing the counterfactual intervention policy as an input-conditioned function. (A) Building on a pre-trained RGB-conditioned predictor $\boldsymbol{\Psi}^\texttt{RGB}$, Opt-CWM uses an image-conditioned perturbation prediction function $\delta_\theta$ containing a small MLP$_\theta$. As illustrated in B, $\delta_\theta$ can learn to predict image-conditioned perturbations that blend naturally with the underlying scene, potentially allowing for the perturbation to be accurately carried over to the next frame prediction. But how should the parameters of $\delta_\theta$ be learned to achieve this, without any flow supervision labels? See Figure \ref{['fig:training_overview']}.
  • Figure 3: A generic principle for learning optimal counterfactuals. A) The parameterized counterfactual flow function $\textbf{FLOW}_\theta$ extracts motion from a frozen RGB-conditioned predictor $\boldsymbol{\Psi}^\texttt{RGB}$ through counterfactual perturbation (details in Figure \ref{['fig:perturber_details']}). Its parameters $\theta$ are trained using gradients from a flow-conditioned predictor $\boldsymbol{\Psi}^\texttt{flow}_\eta$ that is jointly trained to perform next-frame prediction. The predictor $\boldsymbol{\Psi}^\texttt{flow}$ can only learn to predict future frames if it is given correct flow vectors. This explicit information bottleneck ensures useful gradients will get passed back to $\textbf{FLOW}_\theta$ . This setup allows us to get better extractions from a pre-trained $\boldsymbol{\Psi}^\texttt{RGB}$ predictor by training another flow-conditoned predictor $\boldsymbol{\Psi}^\texttt{flow}$ using the same principle of next-frame prediction. (B) As a consequence of tight coupling between the flow-conditioned predictor $\boldsymbol{\Psi}^{\texttt{flow}}$ and the learned flow estimation function $\textbf{FLOW}_\theta$, both motion estimation and pixel reconstruction simultaneously improve.
  • Figure 4: Qualitative comparison with baselines on real-world videos. The above examples show the failure modes of previous methods that rely on visual similarity or photometric loss. We observe that the baseline models struggle against subtle but functionally important changes in largely homogeneous scenes depicting objects of similar color and texture ((a) - (e)). Further, the use of photometric loss in self-supervised methods such as SMURF can also be susceptible to differences in light intensity across frame pairs ((f) - (h)). Opt-CWM, however, relies on a holistic understanding of scene transformations and object dynamics and is able to find correspondence without arbitrary heuristics.
  • Figure 5: Perturbation maps emergently reflect scene properties. For two example frame pairs, we show the amplitudes and standard deviations, at each spatial position and for each color channel, of the optimal Gaussian perturbations predicted by MLP$_\theta$. These "perturbation maps" emergently reflect scene properties, with perturbation parameters varying in size and magnitude depending on where they are located in the image, corresponding to the presence of foreground objects and their parts.
  • ...and 5 more figures