Table of Contents
Fetching ...

SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams

Zhuoheng Gao, Jiyao Zhang, Zhiyong Xie, Hao Dong, Zhaofei Yu, Rongmei Chen, Guozhang Chen, Tiejun Huang

TL;DR

SpikeGrasp presents a neuro-inspired framework for 6-DoF grasp pose detection from raw stereo spike streams, bypassing point-cloud reconstruction. It combines a Visual Pathway Network with a recurrent spiking neural network to iteratively refine a latent grasp-affordance state, followed by Graspable and Grasp Detection networks that output 6-DoF grasps. A large-scale synthetic spike-stream dataset supports end-to-end training and evaluation, demonstrating strong data efficiency and competitive or superior performance in cluttered and textureless scenes, with promising sim-to-real transfer. This work highlights the potential of neuromorphic, spike-based perception for fast, robust robotic manipulation in dynamic environments.

Abstract

Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.

SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams

TL;DR

SpikeGrasp presents a neuro-inspired framework for 6-DoF grasp pose detection from raw stereo spike streams, bypassing point-cloud reconstruction. It combines a Visual Pathway Network with a recurrent spiking neural network to iteratively refine a latent grasp-affordance state, followed by Graspable and Grasp Detection networks that output 6-DoF grasps. A large-scale synthetic spike-stream dataset supports end-to-end training and evaluation, demonstrating strong data efficiency and competitive or superior performance in cluttered and textureless scenes, with promising sim-to-real transfer. This work highlights the potential of neuromorphic, spike-based perception for fast, robust robotic manipulation in dynamic environments.

Abstract

Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.

Paper Structure

This paper contains 40 sections, 17 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of two calculation paths. The above (a) is the use of depth input and point cloud reconstruction to detect feasible grasping steps. And the following (b) is the brain obtains information through the visual system, and then processes the information streams through neural architecture to determine how to grasp it.
  • Figure 2: Overview of SpikeGrasp, the method for direct 6D grasp pose detection from raw stereo spike streams. First, left and right features are extracted from the spike streams to construct correlation features. This correlation feature is fed into an iterative refinement network along with spike features to predict the hidden state of the original 1/4 resolution. Then these hidden states are input into a graspable network (multi-layer convolution and upsampling structure), which uses the raw data as supervision to predict the objectness and graspness of the grasping scene, and these states collectively form the object-scene graspable information. Finally, the obtained hidden states and graspable information are input into the grasp detection network, and the grasp scores and gripper widths are predicted for each group and used to output $K'$ grasp poses.
  • Figure 3: Visualizations of the top 50 grasp poses predicted by our SpikeGrasp algorithm from 12 different scenes (four scenes per subset). A gripper-shaped geometry represents the grasp pose. The grasp pose color changes from blue to purple to indicate increasing confidence.
  • Figure 4: Data efficiency experiments on the synthetic dataset. The horizontal axis indicates the proportion of the training set, while the vertical axes represent the value of AP.
  • Figure 5: Visualize the results of examples in the synthetic dataset. From left to right are the scene map, TFP method reconstruction image of spike stream, objectiveness and graspness maps of the scene.
  • ...and 2 more figures