Table of Contents
Fetching ...

FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Cameron Smith, David Charatan, Ayush Tewari, Vincent Sitzmann

TL;DR

FlowMap introduces a differentiable, end-to-end framework that recovers per-frame depth, camera intrinsics, and poses from video by optimizing a camera-induced flow objective supervised by optical flow and point tracks. Depth is produced by a neural network while poses and intrinsics are obtained through differentiable, analytical solvers, enabling gradient-based refinement without treating all quantities as free variables. Across multiple real-world datasets, FlowMap delivers depth and camera parameter estimates that support high-quality 360° view synthesis with Gaussian Splatting, rivaling COLMAP and outperforming previous gradient-descent baselines. This work paves the way for self-supervised, differentiable multi-view reconstruction and depth learning directly from internet-scale video data.

Abstract

This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).

FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

TL;DR

FlowMap introduces a differentiable, end-to-end framework that recovers per-frame depth, camera intrinsics, and poses from video by optimizing a camera-induced flow objective supervised by optical flow and point tracks. Depth is produced by a neural network while poses and intrinsics are obtained through differentiable, analytical solvers, enabling gradient-based refinement without treating all quantities as free variables. Across multiple real-world datasets, FlowMap delivers depth and camera parameter estimates that support high-quality 360° view synthesis with Gaussian Splatting, rivaling COLMAP and outperforming previous gradient-descent baselines. This work paves the way for self-supervised, differentiable multi-view reconstruction and depth learning directly from internet-scale video data.

Abstract

This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).
Paper Structure (50 sections, 3 equations, 16 figures, 4 tables)

This paper contains 50 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: We present FlowMap, an end-to-end differentiable method that recovers poses, intrinsics, and depth maps of an input video. FlowMap is supervised only with off-the-shelf optical flow and point track correspondences, and optimized per-scene with gradient descent. Gaussian Splats obtained from FlowMap's reconstructions regularly match or exceed those obtained from COLMAP in quality.
  • Figure 2: A FlowMap Forward Pass. Given RGB frames (red), optical flow (blue) and point tracks (green), FlowMap computes dense depth $\mathbf{D}$, camera poses $\textbf{P}$, and intrinsics $\mathbf{K}$ in each forward pass. We obtain depth via a CNN (\ref{['sec:reparams']}) and implement differentiable, feed-forward solvers for intrinsics and poses (\ref{['sec:reparams']}, Fig.\ref{['fig:procrustes']}). Colored dots indicate which block receives which inputs. FlowMap's only free parameters are the weights of a depth NN and a small correspondence confidence MLP. These parameters are optimized for each video separately by minimizing a camera-induced flow loss (Fig. \ref{['fig:loss']}) via gradient descent, though fully feed-forward operation is possible.
  • Figure 3: Camera-Induced Flow Loss. To use a known correspondence $\mathbf{u}_{ij}$ to compute a loss $\mathcal{L}$, we unproject $\mathbf{u}_i$ using the corresponding depth map $\mathbf{D}_i$ and camera intrinsics $\textbf{K}_i$, transform the resulting point $\mathbf{x}_i$ via the relative pose $\mathbf{P}_{ij}$, reproject the transformed point to yield $\hat{\mathbf{u}}_{ij}$, and finally compute $\mathcal{L} = \|\hat{\mathbf{u}}_{ij} - \mathbf{u}_{ij}\|$.
  • Figure 4: We solve for the relative poses between consecutive frames using their depth maps, camera intrinsics, and optical flow. To do so, we first unproject their depth maps, then solve for the pose that best aligns the resulting point clouds.
  • Figure 5: View Synthesis Results. FlowMap's camera parameters and geometry produce near-photorealistic 3D Gaussian Splatting results on par with COLMAP's.
  • ...and 11 more figures