Table of Contents
Fetching ...

Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping

Qianxu Wang, Congyue Deng, Tyler Ga Wei Lum, Yuanpei Chen, Yaodong Yang, Jeannette Bohg, Yixin Zhu, Leonidas Guibas

TL;DR

The neural attention field is proposed for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features, and is applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration.

Abstract

One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the \textit{neural attention field} for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.

Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping

TL;DR

The neural attention field is proposed for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features, and is applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration.

Abstract

One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the \textit{neural attention field} for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.

Paper Structure

This paper contains 34 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Given a one-shot demonstration of a dexterous grasp, we want to generalize to novel scene variations with relevant semantics. To better model the complex semantic feature distributions in hand-object interactions, we propose the neural attention field, which represents a semantic-aware dense feature field by modeling inter-point relevance instead of individual point features. It encourages the end-effector to focus on scene regions with higher task relevance instead of spatial proximity, resulting in robust and semantic-aware transfer of dexterous grasps across scenes.
  • Figure 2: Self-supervised training for the transformer decoder.Left: Given a few scenes, we first select the corresponding keypoints computing cyclic mutual nearest neighbors (MNN) based on their feature similarities. Right: We take the selected keypoints as queries for each scene and enforce the features before and after the $D_\theta$-aggregation to induce the same keypoint correspondences across scenes. Specifically, we apply an InfoNCE loss to preserve the orders of the keypoints given the permutation equivariance of transformers.
  • Figure 3: End-effector optimization in the neural attention field.Left: We sample query points on both the demonstration hand and the target hand to be optimized and obtain their features through the transformer decoder $D_\theta$. Middle: The feature differences induce an energy field. Yellow indicates lower energy values and thus higher feature similarities. Right: Minimizing the energy function w.r.t. the hand parameters gives the final grasping pose. Both hand positions and joint parameters are optimized. The optimization trajectory is shown in green and a few hand poses sampled along the trajectory are shown in blue (optimization steps indicated with colors from shallow to dark).
  • Figure 4: Visualizations of the energy fields and end-effector optimization. In each group from left to right are: the source scene with demonstration (hand shown in blue); the target scene and our results (optimization trajectory shown in green and final resulting hand shown blue); the feature-induced energy fields for our method and SparseDFF wang2023sparsedff. We only visualize the 2D sections of the 3D energy fields and yellow indicates lower energy values and thus higher feature similarities. Our method shows more concentrated low-energy regions around the target grasping positions.
  • Figure 5: Real-robot results. From left to right, we show our results of grasping objects in scene contexts with distraction, grasp transfer between objects with similar semantics but shape variations, and functional grasping of tools.
  • ...and 4 more figures