Can Generative Video Models Help Pose Estimation?
Ruojin Cai, Jason Y. Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, Ricardo Martin-Brualla
TL;DR
Can Generative Video Models Help Pose Estimation? introduces InterPose, which leverages pre-trained generative video priors to interpolate frames between two views with little overlap and uses a self-consistency score to select reliable interpolations for pose estimation. The pipeline treats the pose estimator as a black box $f_ ext{pose}$ and the interpolator as $f_ ext{vid}$, generating multiple videos and frames to compute the relative pose $T_ ext{rel} = T_B T_A^{-1}$; a medoid-based distance $D_ ext{med}$ plus a bias term $D_ ext{bias}$ yields $D_ ext{total}$ for video selection, with the consensus pose $ ilde{T}_ ext{med}$. Across four datasets (outdoor, indoor, and object-centric) and three video generators (e.g., DynamiCrafter, Runway, Luma Dream Machine), InterPose consistently improves over the state-of-the-art DUSt3R when only the input pair is available, and the Oracle upper bound reveals substantial room for better video selection. The work demonstrates the viability of large-scale video priors to augment 3D pose reasoning in data-scarce regimes and points to future directions in faster, more reliable video selection strategies and prompt design.
Abstract
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R on four diverse datasets encompassing indoor, outdoor, and object-centric scenes. Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data. See our project page for results: https://inter-pose.github.io/.
