Triplane Grasping: Efficient 6-DoF Grasping with Single RGB Images
Yiming Li, Hanchi Ren, Yue Yang, Jingjing Deng, Xianghua Xie
TL;DR
This work tackles real-time 6-DoF grasping from a single RGB image by introducing Triplane Grasping, a two-stage pipeline that first reconstructs a usable 3D representation and then predicts grasp poses anchored to that reconstruction. The 3D stage blends an explicit point cloud with an implicit Triplane feature field into a Hybrid Triplane-Gaussian representation, enabling fast, differentiable rendering via Gaussian Splatting. Grasp reasoning runs atop the reconstructed geometry using Contact-GraspNet, reducing the prediction problem to a 4-DoF formulation within $SE(3)$ and applying a contact-filtering step to ensure grasps target the intended object. Experiments on OmniObject3D and GraspNet-1Billion show that the method achieves a favorable balance of speed and accuracy for single RGB inputs, with strong generalization to unseen objects and the fastest inference among reported baselines. The approach offers a practical path toward robust, real-time robotic grasping in diverse, real-world settings, with potential extensions to cluttered scenes and varying object scales.
Abstract
Reliable object grasping is one of the fundamental tasks in robotics. However, determining grasping pose based on single-image input has long been a challenge due to limited visual information and the complexity of real-world objects. In this paper, we propose Triplane Grasping, a fast grasping decision-making method that relies solely on a single RGB-only image as input. Triplane Grasping creates a hybrid Triplane-Gaussian 3D representation through a point decoder and a triplane decoder, which produce an efficient and high-quality reconstruction of the object to be grasped to meet real-time grasping requirements. We propose to use an end-to-end network to generate 6-DoF parallel-jaw grasp distributions directly from 3D points in the point cloud as potential grasp contacts and anchor the grasp pose in the observed data. Experiments on the OmniObject3D and GraspNet-1Billion datasets demonstrate that our method achieves rapid modeling and grasping pose decision-making for daily objects, and strong generalization capability.
