Table of Contents
Fetching ...

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

Matteo Bortolon, Theodore Tsesmelis, Stuart James, Fabio Poiesi, Alessio Del Bue

TL;DR

This paper tackles 6DoF camera pose estimation from a single image given a 3D Gaussian Splatting model. It introduces 6DGS, which inverts the 3DGS rendering by densely ray-casting from ellipsoids via a radiant Ellicell, then uses an attention mechanism to bind rays to image pixels and a weighted least-squares solution to recover the camera pose without initialization. The approach achieves state-of-the-art accuracy and real-time performance on real datasets (Tanks & Temples and Mip-NeRF 360°), outperforming NeRF-based baselines across various pose priors. Ablation studies validate the effectiveness of using a moderate number of top rays and the Ellicell-based ray generation, while noting limitations such as scene-specific retraining. Overall, 6DGS offers robust, fast 6DoF pose estimation suitable for real-world NVS-based robotics and scene understanding tasks.

Abstract

We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

TL;DR

This paper tackles 6DoF camera pose estimation from a single image given a 3D Gaussian Splatting model. It introduces 6DGS, which inverts the 3DGS rendering by densely ray-casting from ellipsoids via a radiant Ellicell, then uses an attention mechanism to bind rays to image pixels and a weighted least-squares solution to recover the camera pose without initialization. The approach achieves state-of-the-art accuracy and real-time performance on real datasets (Tanks & Temples and Mip-NeRF 360°), outperforming NeRF-based baselines across various pose priors. Ablation studies validate the effectiveness of using a moderate number of top rays and the Ellicell-based ray generation, while noting limitations such as scene-specific retraining. Overall, 6DGS offers robust, fast 6DoF pose estimation suitable for real-world NVS-based robotics and scene understanding tasks.

Abstract

We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.
Paper Structure (19 sections, 17 equations, 6 figures, 4 tables)

This paper contains 19 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our 6DGS method introduces a novel approach to 6DoF pose estimation, departing from conventional analysis-by-synthesis methodologies. Standard NeRF-based methods (left) employ an iterative process, rendering candidate poses and comparing them with the target image before updating the pose, which often results in slow performance and limited precision. In contrast, 6DGS (right) estimates the camera pose by selecting a bundle of rays projected from the ellipsoid surface (a radiant Ellicell) and learning an attention map to output ray/image pixel correspondences (based on DINOv2). The optimal bundle of rays should intersect the optical center of the camera and then are used to estimate the camera rotation in closed-form. Our 6GDS method offers significantly improved accuracy and speed, enabling the recovery of the pose within a one-shot estimate.
  • Figure 1: Additional scenes from the Tanks&Temple dataset. For each scene, we show a visualization of the camera poses in regards to the model (top) for 6DGS as well as the baselines, which are visualized with different colors as indicated in the image legend. In addition, for each scene, we showcase the target image (bottom left) along with their corresponding Novel View Synthesis (NVS) output (bottom right) of the estimated camera pose by 6DGS.
  • Figure 2: The figure illustrates the pipeline of our 6DGS methodology. The image is encoded using a visual backbone $\mathbf{(a)}$. Concurrently, rays are uniformly projected from the center of the 3DGS ellipsoids $\mathbf{(b)}$, and their corresponding color is estimated. Subsequently, an attention map mechanism is employed to compare the encoded ray and image features $\mathbf{(c)}$. Following this comparison, the $N_{top}$ matches are selected via attenuation, and the camera location is estimated $\mathbf{(d)}$ as the solution of a weighted Least Squares problem, resulting in a distinct 6DoF pose for the image.
  • Figure 2: Additional scenes from the Mip-NeRF 360° dataset. For each scene, we show a visualization of the camera poses in regards to the model (top) for 6DGS as well as the baselines, which are visualized with different colors as indicated in the image legend. In addition, for each scene, we showcase the target image (bottom left) along with their corresponding 3DGS Novel View Synthesis (NVS) output (bottom right) of the estimated camera pose by 6DGS.
  • Figure 3: The illustration depicts the three primary stages involved in the radiant Ellicell generation. Firstly, (a) delineates the formulation of components required to compute the geometric information for each cell. Secondly, (b) shows the resulting Ellicell grid positioned on the surface of the ellipsoid along with their respective center points. Finally, (c) demonstrates the generation of rays originating from the center point of the ellipsoid going through the Ellicell center.
  • ...and 1 more figures