Table of Contents
Fetching ...

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

Boshu Lei, Wen Jiang, Kostas Daniilidis

TL;DR

ActiveGrasp tackles cluttered robotic grasping by formulating NBV selection as maximizing information gain derived from the grasp pose distribution on the $SE(3)$ manifold. It introduces a calibrated energy-based model (EBM) that captures the multi-modal grasp distribution and aligns the energy with actual grasp success through a learnable temperature and tailored losses, enabling reliable entropy-based planning. The approach combines Gaussian Posterior approximations (GAP), denoised score matching on $SE(3)$, and a novel calibrated grasp generation pipeline to estimate information gain and guide view selection under limited budgets. Experiments in both simulation and real robot setups show improved grasp success rates and lower calibration error compared to state-of-the-art baselines, with a reproducible benchmark built on physically informed simulators. The work provides a principled, scalable framework for active perception in manipulation, and makes its code and data publicly available for reproducibility.

Abstract

Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

TL;DR

ActiveGrasp tackles cluttered robotic grasping by formulating NBV selection as maximizing information gain derived from the grasp pose distribution on the manifold. It introduces a calibrated energy-based model (EBM) that captures the multi-modal grasp distribution and aligns the energy with actual grasp success through a learnable temperature and tailored losses, enabling reliable entropy-based planning. The approach combines Gaussian Posterior approximations (GAP), denoised score matching on , and a novel calibrated grasp generation pipeline to estimate information gain and guide view selection under limited budgets. Experiments in both simulation and real robot setups show improved grasp success rates and lower calibration error compared to state-of-the-art baselines, with a reproducible benchmark built on physically informed simulators. The work provides a principled, scalable framework for active perception in manipulation, and makes its code and data publicly available for reproducibility.

Abstract

Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

Paper Structure

This paper contains 24 sections, 24 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: ActiveGrasp Overview The ActiveGrasp framework actively selects the next view with the highest information gain for the grasp task. Previous methods estimate the information gain using visibility or the affordance map, focusing on regions where visual features are rich but infeasible for grasp. In contrast, our method estimates information gain as entropy reduction of grasp pose distribution, selecting a region that observes highly uncertain grasps.
  • Figure 2: Our Active Grasping Pipeline Our method reconstructs 3D scene $w$ from a set of initial views. The energy-based model estimates the grasp entropy $\eta(w)$ using $w$ and the sampled grasp poses $g$. Then we compute the information gain $\mathbf{I}$ of candidate views from $\nabla_w^2\eta(w)$. We select the next best view with the highest information gain and repeat the procedure above until the view budget is met. Afterwards, we generate grasp poses using the refined scene representation. A planner generates paths to the grasps and executes them on a robot arm.
  • Figure 3: Illustration of entropy on grasp poses This is an example of the different definitions of entropy before the next observation (up) and after the next observation (down). We plot the success rate (blue) and entropy (dashed yellow) of a single grasp moving along the cup. $\tilde{\mathbf{H}}[g|w]$ is the Shannon entropy of the grasp distribution and $\mathbf{H}[g|w]$ is the grasp entropy defined in our paper.
  • Figure 4: Visualization for View Selections on Real-world Experiment We showcase the top 3 view selections for different view selection methods at the same setup. Our method focuses more on perception for regions.