ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

Hao Tang; Weiyao Wang; Pierre Gleize; Matt Feiszli

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

Hao Tang, Weiyao Wang, Pierre Gleize, Matt Feiszli

TL;DR

ADen tackles sparse-view camera pose estimation by learning a conditional distribution over poses using a generator that outputs multiple hypotheses and a discriminator that selects the best one. This generator–discriminator setup avoids both single-mode regression and dense brute-force sampling, achieving high accuracy with only hundreds of samples and real-time inference. The method achieves state-of-the-art results on CO3D with strong zero-shot generalization to Objectron and NMFR, and experiments confirm superior rotation and translation accuracy along with fast inference (~20 FPS for nine images). This approach offers a practical, scalable solution for sparse-view relocalization in real-world 3D vision tasks.

Abstract

Recovering camera poses from a set of images is a foundational task in 3D computer vision, which powers key applications such as 3D scene/object reconstructions. Classic methods often depend on feature correspondence, such as keypoints, which require the input images to have large overlap and small viewpoint changes. Such requirements present considerable challenges in scenarios with sparse views. Recent data-driven approaches aim to directly output camera poses, either through regressing the 6DoF camera poses or formulating rotation as a probability distribution. However, each approach has its limitations. On one hand, directly regressing the camera poses can be ill-posed, since it assumes a single mode, which is not true under symmetry and leads to sub-optimal solutions. On the other hand, probabilistic approaches are capable of modeling the symmetry ambiguity, yet they sample the entire space of rotation uniformly by brute-force. This leads to an inevitable trade-off between high sample density, which improves model precision, and sample efficiency that determines the runtime. In this paper, we propose ADen to unify the two frameworks by employing a generator and a discriminator: the generator is trained to output multiple hypotheses of 6DoF camera pose to represent a distribution and handle multi-mode ambiguity, and the discriminator is trained to identify the hypothesis that best explains the data. This allows ADen to combine the best of both worlds, achieving substantially higher precision as well as lower runtime than previous methods in empirical evaluations.

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 6 figures, 6 tables)

This paper contains 17 sections, 6 equations, 6 figures, 6 tables.

Introduction
Related work
Structure-from-Motion (SfM)
Data-driven pose estimation
Generative learning and contrastive learning
Method
Multi-view feature extraction
Pose generator
Pose discriminator
Model training details
Implementation details
Experiments
Experiment setup
Comparing with SoTA
Ablation
...and 2 more sections

Figures (6)

Figure 1: Ambiguity in wide baseline images. Implicit-PDF/RelPose models rotation as a probability distribution using an energy-based method, which requires evaluating densely sampled rotation hypotheses. To achieve high accuracy, RelPose requires assessing 500k rotations for each image pair, incurring significant computational costs. In contrast, ADen outputs 500 high accuracy hypotheses directly, avoiding the constraints imposed by grid resolution. Filled circles are samples while unfilled circles are the ground truth relative rotation.
Figure 2: ADen overview. ADen is a novel method for recovering camera poses from sparse-view RGB images. ADen starts by extracting per-image features using the ResNet backbone, then uses a transformer to fuse features from all images and propagate information globally. ADen predicts a non-uniform distribution over camera poses for each image by first applying a pose generator head on fused features to produce a support set of $M$ camera poses, then using a pose discriminator with fused features to predict probability on each generated pose.
Figure 3: Camera pose prediction of ADen on CO3D examples.
Figure 4: Relative rotation prediction. We visualize the relative rotations predicted by ADen on ambiguous cases. The circle size of each filled circle represents the probability assigned by the discriminator. The unfilled larger circle is the ground truth.
Figure 5: Performance of the generator and discriminator as number of sampled poses.
...and 1 more figures

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

TL;DR

Abstract

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)