Table of Contents
Fetching ...

Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani

TL;DR

Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos, achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes.

Abstract

Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.

Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

TL;DR

Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos, achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes.

Abstract

Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.
Paper Structure (22 sections, 17 equations, 8 figures, 15 tables)

This paper contains 22 sections, 17 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Flow3r leverages unlabeled videos (using flow supervision) alongside labeled 3D data for scalable visual geometry learning. This enables accurate multi-view 3D reconstruction in-the-wild, in particular for settings with scarce labeled data e.g., interaction videos and dynamic scenes.
  • Figure 2: Mechanisms for flow prediction. (a) The visual geometry backbone first extracts the camera token and patch tokens of each input. (b) Existing correspondence heads wang2025vggt predict flow directly from local features via matching. (c) Flow can also be obtained by explicitly projecting predicted 3D points into another view via decoded camera parameters. However, this projection-based formulation is restricted to static scenes and sensitive to geometric errors. (d) Our factored flow approach conditions source-view geometry latents on the target-view camera latent and decodes dense correspondences directly in latent space, providing a geometry-aware and robust flow prediction mechanism that naturally extends to dynamic scenes.
  • Figure 3: Overview of Flow3r.Flow3r predicts visual geometry using factored flow supervision, enabling scalable geometry learning from unlabeled videos. Each input image is encoded and processed by the multi-view transformer to produce camera tokens and patch tokens. For data with dense geometry and pose labels, we directly supervise the patch tokens and camera tokens with the corresponding labels. For unlabeled datasets without geometry and pose supervision, we predict flow between two frames in a factorized manner, supervised by the pseudo labels from an off-the-shelf 2D flow prediction model zhang2025ufm. To obtain the factored flow, we fuse the patch features of one frame with the camera features of another, and decode the fused representation through the DPT head to produce dense flow predictions.
  • Figure 4: Factored flow prediction aids visual geometry learning. Compared with the baseline (3d-sup) and alternative formulations that use flow supervision (flow-projective, flow-tracking), Flow3r (flow-factored) yields more accurate scene reconstruction, highlighting the benefits of unlabeled data and our factored flow formulation.
  • Figure 5: Scaling with unlabeled videos. With factored flow supervision, increasing the amount of unlabeled SpatialVID data progressively improves dynamic-scene reconstruction quality, even surpassing the model with additional 3D-labeled training data.
  • ...and 3 more figures