Table of Contents
Fetching ...

GSGTrack: Gaussian Splatting-Guided Object Pose Tracking from RGB Videos

Zhiyuan Chen, Fan Lu, Guo Yu, Bin Li, Sanqing Qu, Yuan Huang, Changhong Fu, Guang Chen

TL;DR

This work addresses RGB-only 6DoF object pose tracking without reliable depth information by proposing GSGTrack, which jointly optimizes pose and geometry using an online 3D Gaussian Splatting representation and a graph-based geometric optimization. It introduces a differential silhouette loss and a confidence-aware image-pair pruning strategy to mitigate noise in geometry, and employs an online 3DGS that incrementally reconstructs the object while guiding pose updates. The optimization is organized within a dynamic geometric structure graph, enabling robust fusion of historical views and depth confidences. Experimental results on HO3D and OnePose demonstrate improved pose accuracy and reconstruction quality over strong baselines, highlighting the method’s robustness to depth noise, occlusions, and low-texture regions, with practical implications for monocular robotic manipulation.

Abstract

Tracking the 6DoF pose of unknown objects in monocular RGB video sequences is crucial for robotic manipulation. However, existing approaches typically rely on accurate depth information, which is non-trivial to obtain in real-world scenarios. Although depth estimation algorithms can be employed, geometric inaccuracy can lead to failures in RGBD-based pose tracking methods. To address this challenge, we introduce GSGTrack, a novel RGB-based pose tracking framework that jointly optimizes geometry and pose. Specifically, we adopt 3D Gaussian Splatting to create an optimizable 3D representation, which is learned simultaneously with a graph-based geometry optimization to capture the object's appearance features and refine its geometry. However, the joint optimization process is susceptible to perturbations from noisy pose and geometry data. Thus, we propose an object silhouette loss to address the issue of pixel-wise loss being overly sensitive to pose noise during tracking. To mitigate the geometric ambiguities caused by inaccurate depth information, we propose a geometry-consistent image pair selection strategy, which filters out low-confidence pairs and ensures robust geometric optimization. Extensive experiments on the OnePose and HO3D datasets demonstrate the effectiveness of GSGTrack in both 6DoF pose tracking and object reconstruction.

GSGTrack: Gaussian Splatting-Guided Object Pose Tracking from RGB Videos

TL;DR

This work addresses RGB-only 6DoF object pose tracking without reliable depth information by proposing GSGTrack, which jointly optimizes pose and geometry using an online 3D Gaussian Splatting representation and a graph-based geometric optimization. It introduces a differential silhouette loss and a confidence-aware image-pair pruning strategy to mitigate noise in geometry, and employs an online 3DGS that incrementally reconstructs the object while guiding pose updates. The optimization is organized within a dynamic geometric structure graph, enabling robust fusion of historical views and depth confidences. Experimental results on HO3D and OnePose demonstrate improved pose accuracy and reconstruction quality over strong baselines, highlighting the method’s robustness to depth noise, occlusions, and low-texture regions, with practical implications for monocular robotic manipulation.

Abstract

Tracking the 6DoF pose of unknown objects in monocular RGB video sequences is crucial for robotic manipulation. However, existing approaches typically rely on accurate depth information, which is non-trivial to obtain in real-world scenarios. Although depth estimation algorithms can be employed, geometric inaccuracy can lead to failures in RGBD-based pose tracking methods. To address this challenge, we introduce GSGTrack, a novel RGB-based pose tracking framework that jointly optimizes geometry and pose. Specifically, we adopt 3D Gaussian Splatting to create an optimizable 3D representation, which is learned simultaneously with a graph-based geometry optimization to capture the object's appearance features and refine its geometry. However, the joint optimization process is susceptible to perturbations from noisy pose and geometry data. Thus, we propose an object silhouette loss to address the issue of pixel-wise loss being overly sensitive to pose noise during tracking. To mitigate the geometric ambiguities caused by inaccurate depth information, we propose a geometry-consistent image pair selection strategy, which filters out low-confidence pairs and ensures robust geometric optimization. Extensive experiments on the OnePose and HO3D datasets demonstrate the effectiveness of GSGTrack in both 6DoF pose tracking and object reconstruction.

Paper Structure

This paper contains 24 sections, 14 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: We are tackling a challenging problem: tracking 6DoF pose of unknown objects from RGB videos without accurate depth information. When applied to RGB videos with inaccurate estimated depth information Yang2024DepthAnything, RGBD-based methods Wen2023BundleSDF degenerates quickly. In contrast, our method achieves robust tracking and reconstruction results.
  • Figure 2: Overview of our proposed GSGTrack. To achieve accurate 6DoF object pose tracking without relying on precise depth information, we propose a joint optimization framework. Starting with a video sequence, we preprocess consecutive frames by generating object masks and estimating coarse geometry. Next, we introduce an online 3DGS representation that facilitates continuous object reconstruction from incoming video frames. Building on this 3D representation, we design a graph-based geometric optimization framework that refines both object pose and 3D structure through an online geometric structure graph. Additionally, we introduce an image pair pruning strategy and a confidence-aware geometric optimization technique to enhance the robustness and accuracy of the optimization process.
  • Figure 3: Qualitative Comparison of GSGTrack and Baseline on HO3D. Left: 6-DOF pose tracking with green and yellow boxes showing ground truth and estimated poses, respectively. Right: front and back views of reconstruction results, highlighting the object's geometric structure. Due to hand occlusions, black hand-shaped artifacts appear, obscuring parts of the object. Our reconstruction corrects the color divergence between ground truth and actual object colors seen in the video.
  • Figure 4: We visualize the Relative Rotation Error (RRE) of different settings of our method.
  • Figure 5: Impact of our Gaussian pruning strategy on reconstruction quality. Our strategy significantly enhances geometric accuracy and effectively eliminates floaters.
  • ...and 6 more figures