Table of Contents
Fetching ...

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos

Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, Daniel Cremers

TL;DR

This work tackles recovering camera poses and intrinsics from casual, dynamic videos without ground-truth 3D labels. It introduces AnyCam, a transformer-based model that predicts relative poses $\mathbf{P}^{i\rightarrow i+1}$ and a focal-length intrinsic $f$ using depth $\mathbf{D}^i$ and flow $\mathbf{F}^{i\rightarrow j}$ as priors, and handles intrinsics via multiple focal-length hypotheses with likelihood-based selection. Training employs an uncertainty-aware flow loss, forward-backward pose consistency, and sequence-wide motion priors, enabling supervision from unlabelled data; a lightweight test-time bundle adjustment refines trajectories to reduce drift. AnyCam achieves state-of-the-art or competitive zero-shot performance on dynamic benchmarks, runs faster than prior SfM/SLAM approaches in dynamic settings, and enables high-quality 4D reconstructions by combining camera information, depth, and uncertainty.

Abstract

Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos

TL;DR

This work tackles recovering camera poses and intrinsics from casual, dynamic videos without ground-truth 3D labels. It introduces AnyCam, a transformer-based model that predicts relative poses and a focal-length intrinsic using depth and flow as priors, and handles intrinsics via multiple focal-length hypotheses with likelihood-based selection. Training employs an uncertainty-aware flow loss, forward-backward pose consistency, and sequence-wide motion priors, enabling supervision from unlabelled data; a lightweight test-time bundle adjustment refines trajectories to reduce drift. AnyCam achieves state-of-the-art or competitive zero-shot performance on dynamic benchmarks, runs faster than prior SfM/SLAM approaches in dynamic settings, and enables high-quality 4D reconstructions by combining camera information, depth, and uncertainty.

Abstract

Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.

Paper Structure

This paper contains 41 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: AnyCam. Given a casual video and pretrained monocular depth estimation (MDE) and optical flow networks, AnyCam outputs camera poses, camera intrinsics, and uncertainty maps in a single forward pass. The uncertainty maps represent probable movement in the scene. By using a novel loss formulation, AnyCam can be trained on a large corpus of unlabelled videos mostly obtained from YouTube.
  • Figure 2: Architecture. AnyCam processes a sequence of frames from a casual video with corresponding depth maps and optical flow. A backbone extracts feature maps per image. Information sharing between frames is enabled by multiple attention layers that process the features of all sequence images. The transformer architecture outputs one pose token $\phi^{i \rightarrow j}$ per timestep and an additional sequence token $\phi^\text{seq}$. The pose tokens are processed using multiple intrinsic hypotheses $f\in \{f_1, \ldots , f_m\}$, parametrized by frame prediction heads $(\mathcal{H}^\mathbf{P}_f, \mathcal{H}^\mathbf{\sigma}_f)$. The sequence head $\mathcal{H}^\mathbf{seq}$ predicts the likelihood scores of the different hypotheses. The model is trained end-to-end via a reprojection loss, a pose consistency loss between forward and backward pose predictions, and a KL-divergence loss.
  • Figure 3: Qualitative results on various datasets.Red: forward-pass prediction, Green: refined trajectory, Yellow: GT (if available). AnyCam is able to predict high-quality pose estimates on challenging scenes in dynamic environments. The uncertainty maps show objects with a high likelihood of movement, such as persons or cars, that would produce inconsistencies in the induced optical flow. Pose refinement with bundle adjustment further aligns the trajectory towards reducing the error compared to the ground truth poses.
  • Figure 4: Focal length candidates. Predicted likelihood and computed flow loss for different focal candidate (FL) hypotheses for two sequences of the Davis dataset. Below: Estimated trajectories for selected FL hypotheses increasing from left to right. Red trajectory shows the trajectory for the FL with the highest likelihood. The predicted likelihood tends to be more stable than the loss when estimating the best candidate.
  • Figure 5: Focal Length Candidates. Linear-exponential distribution of focal length candidates relative to the image height.