Table of Contents
Fetching ...

Forecasting Motion in the Wild

Neerja Thakkar, Shiry Ginosar, Jacob Walker, Jitendra Malik, Joao Carreira, Carl Doersch

Abstract

Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.

Forecasting Motion in the Wild

Abstract

Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.

Paper Structure

This paper contains 32 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Dense point trajectories act as visual tokens for behavior, enabling scalable prediction of complex motion across diverse species. Our method takes as input a single RGB image, a history of motion, and an optional high-level motion vector, and forecasts future animal motion in the form of point trajectories. For each predicted point trajectory, we translate a small circular patch of the input image along the motion trajectory and superimpose it on the input image (no pixels are generated!). Leftmost shows the start locations on the input frame; the rest is forecast by our model. Our method is capable of forecasting many different animal species and behaviors, even long-tail ones---the polar bear on the top right is only present in $0.31\%$ of the training data, the caribou on the bottom left in $0.025\%$, and the alpaca on the bottom right in $0.50\%$. See more results at https://motion-forecasting.github.io/.
  • Figure 2: Architecture. Given an input frame and (noisy) tracks, we construct a single token for every track, which includes a DINO feature at the start location, the motion history, and the noisy track values, both with occlusion indicators. After projection, we add a position encoding for the initial point location. Tokens are stacked and fed to a transformer (DiT) to predict clean tracks (right).
  • Figure 3: Our processed data before and after camera stabilization. Given a first frame (left), the middle image shows the point tracks in pixel space, where the motion of the animals and the camera (panning, zooming out) are entangled. On the right are our point tracks in camera-stabilized space. We release all of our annotations, including the camera-stabilized point tracks.
  • Figure 4: Animal motion follows a log normal distribution: We plot a histogram of animal displacement. Horizontal axis is a binned log displacement, while vertical axis is log frequency. We find that log-normal (purple) fits much better than both a power law (orange).
  • Figure 5: Samples from our model Sampling from our model with different random seeds (each row) and no displacement conditioning. The frame on the left is the input state after the motion history.
  • ...and 4 more figures