Table of Contents
Fetching ...

MegaFlow: Zero-Shot Large Displacement Optical Flow

Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu

Abstract

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.

MegaFlow: Zero-Shot Large Displacement Optical Flow

Abstract

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.

Paper Structure

This paper contains 19 sections, 9 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: MegaFlow excels at large displacement optical flow and point tracking. (a) On the Sintel (Final) benchmark, MegaFlow consistently achieves the lowest End-Point Error (EPE), with its advantage widening significantly on large displacements. (b) MegaFlow also demonstrates superior zero-shot point tracking results on TAP-Vid. (c) Visuals and inset error maps further illustrate our state-of-the-art results.
  • Figure 2: The pipeline of MegaFlow. Given an input sequence, a frozen DINO and a trainable CNN extract dense patch tokens and local structural features. Alternating frame and global attention, followed by feature fusion, process these tokens into a globally consistent representation. Pair-wise global matching then computes initial flows. Finally, a recurrent module iteratively refines the initial flows using spatial convolutions and temporal attention for sub-pixel accuracy. Crucially, our design seamlessly processes variable-length inputs without architectural modifications.
  • Figure 3: Qualitative comparison of long-range point tracking. Visualization of SEA-RAFT wang2024sea, MemFlow dong2024memflow and AllTracker harley2025alltracker and our method on the DAVIS benchmark. The first column shows the input frames (spanning 90 frames). The top row visualizes long-range dense point tracking, while the bottom row shows the corresponding optical flow between the first and last frame. Our method produces more accurate and temporally consistent tracks and flow estimates over very long sequences.
  • Figure 4: Qualitative comparison of optical flow. Visualization of SEA-RAFT wang2024sea, MemFlow dong2024memflow, WAFT-DAv2-a2 wang2025waft, and our method on the Spring benchmark. The colorbar indicates endpoint error. Our approach outperforms prior methods, demonstrating that our method generalizes well to Full HD resolution while preserving both local and global motion details.
  • Figure 5: Impact of multi-frame context on temporal consistency. Top row: Consecutive input frames. Middle row: Optical flow estimated from isolated frame pairs. Bottom row: Flow estimated jointly ($T=4$). Processing isolated pairs leads to temporal inconsistencies and occlusion artifacts, particularly around the moving subject (red boxes) and background structures (blue boxes). In contrast, our expanded multi-frame context produces highly stable and accurate motion boundaries.
  • ...and 3 more figures