AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, Daniel Cremers
TL;DR
This work tackles recovering camera poses and intrinsics from casual, dynamic videos without ground-truth 3D labels. It introduces AnyCam, a transformer-based model that predicts relative poses $\mathbf{P}^{i\rightarrow i+1}$ and a focal-length intrinsic $f$ using depth $\mathbf{D}^i$ and flow $\mathbf{F}^{i\rightarrow j}$ as priors, and handles intrinsics via multiple focal-length hypotheses with likelihood-based selection. Training employs an uncertainty-aware flow loss, forward-backward pose consistency, and sequence-wide motion priors, enabling supervision from unlabelled data; a lightweight test-time bundle adjustment refines trajectories to reduce drift. AnyCam achieves state-of-the-art or competitive zero-shot performance on dynamic benchmarks, runs faster than prior SfM/SLAM approaches in dynamic settings, and enables high-quality 4D reconstructions by combining camera information, depth, and uncertainty.
Abstract
Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.
