Table of Contents
Fetching ...

GS-Pose: Generalizable Segmentation-based 6D Object Pose Estimation with 3D Gaussian Splatting

Dingding Cai, Janne Heikkilä, Esa Rahtu

TL;DR

GS-Pose tackles generalizable $6D$ object pose estimation from RGB images for novel objects by building a three-representation reference database offline and applying a cascaded inference online: detector, initial pose via rotation-aware template retrieval, and a differentiable render-and-compare GS-Refiner. The core contributions are (i) a semantic representation, rotation-aware embeddings, and a 3D Gaussian Object representation; (ii) a segmentation-based detection and rotation-aware matching pipeline; and (iii) a fast, differentiable 3D Gaussian splatting renderer enabling iterative pose refinement. The approach achieves state-of-the-art results on LINEMOD and OnePose-LowTexture, showcasing strong performance on textureless and symmetric objects while using commodity hardware for data capture. This work advances RGB-only, model-free pose estimation by integrating multiple specialized representations and a differentiable 3D rendering-based refinement, with practical implications for robotics and AR where rapid acquisition of new objects is feasible.

Abstract

This paper introduces GS-Pose, a unified framework for localizing and estimating the 6D pose of novel objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and refining the pose with a render-and-compare method. The key insight is the application of the appropriate object representation at each stage of the process. In particular, for the refinement step, we leverage 3D Gaussian splatting, a novel differentiable rendering technique that offers high rendering speed and relatively low optimization time. Off-the-shelf toolchains and commodity hardware, such as mobile phones, can be used to capture new objects to be added to the database. Extensive evaluations on the LINEMOD and OnePose-LowTexture datasets demonstrate excellent performance, establishing the new state-of-the-art. Project page: https://dingdingcai.github.io/gs-pose.

GS-Pose: Generalizable Segmentation-based 6D Object Pose Estimation with 3D Gaussian Splatting

TL;DR

GS-Pose tackles generalizable object pose estimation from RGB images for novel objects by building a three-representation reference database offline and applying a cascaded inference online: detector, initial pose via rotation-aware template retrieval, and a differentiable render-and-compare GS-Refiner. The core contributions are (i) a semantic representation, rotation-aware embeddings, and a 3D Gaussian Object representation; (ii) a segmentation-based detection and rotation-aware matching pipeline; and (iii) a fast, differentiable 3D Gaussian splatting renderer enabling iterative pose refinement. The approach achieves state-of-the-art results on LINEMOD and OnePose-LowTexture, showcasing strong performance on textureless and symmetric objects while using commodity hardware for data capture. This work advances RGB-only, model-free pose estimation by integrating multiple specialized representations and a differentiable 3D rendering-based refinement, with practical implications for robotics and AR where rapid acquisition of new objects is feasible.

Abstract

This paper introduces GS-Pose, a unified framework for localizing and estimating the 6D pose of novel objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and refining the pose with a render-and-compare method. The key insight is the application of the appropriate object representation at each stage of the process. In particular, for the refinement step, we leverage 3D Gaussian splatting, a novel differentiable rendering technique that offers high rendering speed and relatively low optimization time. Off-the-shelf toolchains and commodity hardware, such as mobile phones, can be used to capture new objects to be added to the database. Extensive evaluations on the LINEMOD and OnePose-LowTexture datasets demonstrate excellent performance, establishing the new state-of-the-art. Project page: https://dingdingcai.github.io/gs-pose.
Paper Structure (36 sections, 11 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 11 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of GS-Pose. GS-Pose involves two distinct phases to achieve pose estimation for a novel object, i.e., reference database creation and object pose inference. The first phase operates offline and occurs only once per object to construct multiple representations of the object. These representations include an object semantic representation ($\mathcal{F}^{obj}$), a set of rotation-aware embedding vectors ($\{V_i^{obj}\}^{N_r}_{i=1}$), and a 3D Gaussian Object ($\mathcal{G}^{obj}$). During inference, GS-Pose first employs an object detector to detect the object in a query image using the semantic information $\mathcal{F}^{obj}$. Then, GS-Pose adopts a pose estimator to produce an initial pose (blue box) from the detection result with the rotation-aware embeddings $\{V_i^{obj}\}^{N_r}_{i=1}$, Finally, GS-Pose leverages a pose refinement module (GS-Refiner) with $\mathcal{G}^{obj}$ to obtain a refined pose (green box). We indicate the ground-truth pose in red.
  • Figure 2: Overview of the reference database creation process. We begin by selecting a group of keyframes from reference images. (1). These keyframes are processed through DINOv2 and Co-Segmenter to jointly predict object segmentation masks, which are then utilized to extract the object semantic tokens ($\mathcal{F}^{obj}$) from the keyframe features. (2). Image-wise object segmentation is performed for all reference images $\{I^{ref}_i\}^{N_r}_{i=1}$ using an Obj-Segmenter with the obtained semantic information $\mathcal{F}^{obj}$. We then employ an RA-Encoder to extract the rotation-aware embeddings $\{V_i^{obj}\}^{N_r}_{i=1}$ from the segmented images. (3). Finally, we create a 3D Gaussian Object representation $\mathcal{G}^{obj}$ (viewed as a 3D point cloud for simplicity) using all segmented images with the known poses.
  • Figure 3: (1). Co-Segmenter includes a transformer-like module and a mask decoder to produce the co-segmentation masks. (2). Obj-Segmenter consists of the DINOv2 backbone, a transformer-like module, and a mask decoder to predict the object mask. (3). RA-Encoder contains the DINOv2 backbone, four $3\times 3$ 2D convolutional (Conv2D) layers with stride 2, a generalizable average pooling layer, and a fully connected (FC) layer.
  • Figure 4: (1). Detector first employs an Obj-Segmenter to produce a mask from the input image using the semantic information ($\mathcal{F}^{obj}$). Then, connected components are computed from the predicted mask to generate proposals, which are further processed by a proposal selector to determine the final output. (2). Pose Estimator utilizes an Obj-Segmenter to predict an object mask $M^{que}$ ($\mathcal{F}^{obj}$ is omitted for clarity). An embedding vector $V^{que}$ is then extracted from the segmented image using RA-Encoder, followed by a pose decoder for estimating an initial pose ($P_{init}$) using both $V^{que}$ and $M^{que}$. (3). GS-Refiner starts by applying an optimizable transformation $T^{j-1}_{gs}$ to the 3D coordinates of the 3D Gaussian Object (3DGO) $\mathcal{G}^{obj}$, where $j \geq 1$ is the refinement step. Then, the 3D Gaussian Splatting-based renderer (3DGS-Renderer) generates an RGB image ($I_{j}^{rend}$) using the initial pose ($P_{init}$) and the transformed 3DGO ($\hat{\mathcal{G}}^{obj}$). Finally, the gradient $\Delta{T_i}$ is used to update the transformation parameter $T^{j}_{gs}$, minimizing the difference ($\mathcal{L}_{gs}$) between the rendered and the segmented images.
  • Figure 5: Qualitative evaluation on LINEMOD. We present the intermediate segmentation mask predictions (for localization) as well as the estimated 6D poses. Blue, green, and red boxes represent initial, refined, and ground truth poses, respectively.