Table of Contents
Fetching ...

Can Generative Video Models Help Pose Estimation?

Ruojin Cai, Jason Y. Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, Ricardo Martin-Brualla

TL;DR

Can Generative Video Models Help Pose Estimation? introduces InterPose, which leverages pre-trained generative video priors to interpolate frames between two views with little overlap and uses a self-consistency score to select reliable interpolations for pose estimation. The pipeline treats the pose estimator as a black box $f_ ext{pose}$ and the interpolator as $f_ ext{vid}$, generating multiple videos and frames to compute the relative pose $T_ ext{rel} = T_B T_A^{-1}$; a medoid-based distance $D_ ext{med}$ plus a bias term $D_ ext{bias}$ yields $D_ ext{total}$ for video selection, with the consensus pose $ ilde{T}_ ext{med}$. Across four datasets (outdoor, indoor, and object-centric) and three video generators (e.g., DynamiCrafter, Runway, Luma Dream Machine), InterPose consistently improves over the state-of-the-art DUSt3R when only the input pair is available, and the Oracle upper bound reveals substantial room for better video selection. The work demonstrates the viability of large-scale video priors to augment 3D pose reasoning in data-scarce regimes and points to future directions in faster, more reliable video selection strategies and prompt design.

Abstract

Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R on four diverse datasets encompassing indoor, outdoor, and object-centric scenes. Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data. See our project page for results: https://inter-pose.github.io/.

Can Generative Video Models Help Pose Estimation?

TL;DR

Can Generative Video Models Help Pose Estimation? introduces InterPose, which leverages pre-trained generative video priors to interpolate frames between two views with little overlap and uses a self-consistency score to select reliable interpolations for pose estimation. The pipeline treats the pose estimator as a black box and the interpolator as , generating multiple videos and frames to compute the relative pose ; a medoid-based distance plus a bias term yields for video selection, with the consensus pose . Across four datasets (outdoor, indoor, and object-centric) and three video generators (e.g., DynamiCrafter, Runway, Luma Dream Machine), InterPose consistently improves over the state-of-the-art DUSt3R when only the input pair is available, and the Oracle upper bound reveals substantial room for better video selection. The work demonstrates the viability of large-scale video priors to augment 3D pose reasoning in data-scarce regimes and points to future directions in faster, more reliable video selection strategies and prompt design.

Abstract

Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R on four diverse datasets encompassing indoor, outdoor, and object-centric scenes. Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data. See our project page for results: https://inter-pose.github.io/.

Paper Structure

This paper contains 30 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Improving pose estimation by interpolating frames using a video model. Given two images of a scene with almost no overlap, we aim to recover their relative camera pose. Without being able to rely on visual correspondences, existing methods struggle in this setting (left). We propose to use an off-the-shelf video generation model to interpolate a video connecting the two images. Augmented with the frames generated by the video model, existing pose estimators (e.g. DUSt3R wang2024dust3r) are able to more accurately recover the correct pose (right).
  • Figure 2: Common failure modes of video models. We show some failure modes of interpolating between two images. In the first row, a microwave suddenly appears over the sink. In the second and third row, the video model morphs and blends images without consistent changes to the underlying scene geometry. In the fourth row, the object's appearance changes in an unrealistic way.
  • Figure 3: Qualitative comparison of the three video models: DynamiCrafter (DC), Runway (RW), and Dream Machine (DM), using the same text prompt for each video model. Top left: a pair of images from the Cambridge Landmarks dataset. Prompt: Dozens of bicycles are parked along the street in front of old brick and stone buildings, with a person walking by and trees in the background. Bottom left: a pair of images from ScanNet. Prompt: A cozy café corner features wooden chairs, framed sports photos, and a TV screen. Top right: a pair from DL3DV-10K. Prompt: A peaceful morning stroll along a wooden boardwalk surrounded by lush, sunlit greenery. Bottom right: a pair from NAVI. Prompt: A wooden toy figure with gray ears and green wheels sits next to a small yellow school bus on a black pedestal in an outdoor paved area.
  • Figure 4: Self-consistency scores for poses derived from generated videos. (a) From a pair of input frames $A$ and $B$, we generate several candidate videos from a given video interpolation method. For each video, we sample subsets of frames and compute a relative pose from $A$ to $B$ from each subset ((b) and (c)). We then compute a medoid distance between these samples as a self-consistency score for that video, shown to the left of each video in part (a). In this case, Video 0 contains artifacts, and so yields an inconsistent set of poses (and a high medoid distance), which Video 1 is much more natural and produces a more consistent set of poses and a lower medoid distance.
  • Figure 5: Qualitative results of pose estimation from DUSt3R taking only image pair as input and taking additional video frames. We show the input image pair in the first two columns, and the DUSt3R prediction using the image pair alone in the third column. The 3D reconstruction shows the predicted point maps and camera poses for the input images, with the first camera denoted in blue, the second camera in gold, and its corresponding ground truth camera in red, best seen digitally. In columns four to six, we visualize interpolated frames from three different video models. In the last column, we show the DUSt3R pose predictions made using all 5 images, but we are only showing the poses and pointmaps corresponding to the input images for clarity.
  • ...and 3 more figures