Table of Contents
Fetching ...

6-DoF Grasp Pose Evaluation and Optimization via Transfer Learning from NeRFs

Gergely Sóti, Xi Huang, Christian Wurll, Björn Hein

TL;DR

This work introduces an implicit grasping framework that leverages a pretrained MVNeRF scene representation to evaluate 6-DoF grasp candidates using a learned scoring function trained from few demonstrations. Grasp poses are optimized via gradient ascent by maximizing the evaluation score, enabling generalization from simulated 4-DoF top-down grasps to 6-DoF grasps in both cluttered simulations and real-world environments without additional data. The MVNeRF backbone enables transfer of visual and geometric scene priors, and the approach is evaluated across simple, cluttered, and novel-object simulators plus real-world experiments, showing robust sim-to-real transfer and highlighting calibration sensitivity as a key real-world challenge. The work demonstrates the viability of NeRF-based implicit representations for real-time grasp planning, achieving competitive performance with limited training data and suggesting avenues for enhanced task grounding and planning. Overall, it advances data-efficient, geometry-aware grasping by uniting NeRF scene representations with implicit grasp evaluation and gradient-based optimization.

Abstract

We address the problem of robotic grasping of known and unknown objects using implicit behavior cloning. We train a grasp evaluation model from a small number of demonstrations that outputs higher values for grasp candidates that are more likely to succeed in grasping. This evaluation model serves as an objective function, that we maximize to identify successful grasps. Key to our approach is the utilization of learned implicit representations of visual and geometric features derived from a pre-trained NeRF. Though trained exclusively in a simulated environment with simplified objects and 4-DoF top-down grasps, our evaluation model and optimization procedure demonstrate generalization to 6-DoF grasps and novel objects both in simulation and in real-world settings, without the need for additional data. Supplementary material is available at: https://gergely-soti.github.io/grasp

6-DoF Grasp Pose Evaluation and Optimization via Transfer Learning from NeRFs

TL;DR

This work introduces an implicit grasping framework that leverages a pretrained MVNeRF scene representation to evaluate 6-DoF grasp candidates using a learned scoring function trained from few demonstrations. Grasp poses are optimized via gradient ascent by maximizing the evaluation score, enabling generalization from simulated 4-DoF top-down grasps to 6-DoF grasps in both cluttered simulations and real-world environments without additional data. The MVNeRF backbone enables transfer of visual and geometric scene priors, and the approach is evaluated across simple, cluttered, and novel-object simulators plus real-world experiments, showing robust sim-to-real transfer and highlighting calibration sensitivity as a key real-world challenge. The work demonstrates the viability of NeRF-based implicit representations for real-time grasp planning, achieving competitive performance with limited training data and suggesting avenues for enhanced task grounding and planning. Overall, it advances data-efficient, geometry-aware grasping by uniting NeRF scene representations with implicit grasp evaluation and gradient-based optimization.

Abstract

We address the problem of robotic grasping of known and unknown objects using implicit behavior cloning. We train a grasp evaluation model from a small number of demonstrations that outputs higher values for grasp candidates that are more likely to succeed in grasping. This evaluation model serves as an objective function, that we maximize to identify successful grasps. Key to our approach is the utilization of learned implicit representations of visual and geometric features derived from a pre-trained NeRF. Though trained exclusively in a simulated environment with simplified objects and 4-DoF top-down grasps, our evaluation model and optimization procedure demonstrate generalization to 6-DoF grasps and novel objects both in simulation and in real-world settings, without the need for additional data. Supplementary material is available at: https://gergely-soti.github.io/grasp
Paper Structure (12 sections, 3 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 3 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Two grasp candidates after optimization for grasping a boot in the real world. The left images display the model's input, the middle images depict how the model 'imagines' the scene from a different perspective, and the right image shows the executed grasp, indicated by the green marker. Note that the markers depicting the gripper pose are overlays and do not get occluded by objects in the scene. The model was only trained in simulation using monochromatic prismatic objects and top-down grasps; it has never been seen boots, real-world images, or non-top-down grasps.
  • Figure 2: Left: generating training samples from a demonstration; middle and right: training method - 5-DoF poses are computed from the 6-DoF grasp candidate and are evaluated and fused in the grasp evaluation model, which incorporates the pre-trained MVNeRF.
  • Figure 3: MVNeRF renderings for the simulated tasks. Left: input images with known camera parameters; middle: rendering of 1-view and 3-views MVNeRF; right: ground truth generated in simulation. Note that depth information was neither utilized in training nor in inference, and is displayed only for visualization.
  • Figure 4: Depiction of grasp value estimations in 6-DoF space. Left: a valid grasp pose and the corners of the tx-ty slice (translational displacement along x and y axes); right: visualization of negative grasp errors and estimated grasp values across multiple slices in translational and rotational dimensions. Interpretation of the slices: in the tx-ty slice of the negated grasp error function, regions maintaining red hues signify deviations along the y-axis still result in valid grasps. Bright white spots, corresponding to the arms of the T-shaped object, are indicative of minimal translational error, yet are not red due to maximal rotational error.