Table of Contents
Fetching ...

Towards a Multi-Embodied Grasping Agent

Roman Freiberg, Alexander Qualmann, Ngo Anh Vien, Gerhard Neumann

TL;DR

The paper introduces a data-efficient, equivariant flow-based framework for multi-embodiment grasping that generalizes across grippers with varying DoFs. It leverages SE(3)-equivariant representations and a JAX-based, batched architecture to synthesize pre-grasp poses directly from full-scene point clouds, avoiding reliance on pose estimation. Key contributions include per-joint equivariant gripper embeddings, a multiscale equivariant scene encoder, and a flow-decoding pipeline trained with flow-matching, validated on a large, multi-gripper dataset with both single- and multi-embodiment settings. The work demonstrates competitive performance with state-of-the-art methods while enabling scalable training/inference and releasing open-source code and data to facilitate future research.

Abstract

Multi-embodiment grasping focuses on developing approaches that exhibit generalist behavior across diverse gripper designs. Existing methods often learn the kinematic structure of the robot implicitly and face challenges due to the difficulty of sourcing the required large-scale data. In this work, we present a data-efficient, flow-based, equivariant grasp synthesis architecture that can handle different gripper types with variable degrees of freedom and successfully exploit the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry. Unlike previous equivariant grasping methods, we translated all modules from the ground up to JAX and provide a model with batching capabilities over scenes, grippers, and grasps, resulting in smoother learning, improved performance and faster inference time. Our dataset encompasses grippers ranging from humanoid hands to parallel yaw grippers and includes 25,000 scenes and 20 million grasps.

Towards a Multi-Embodied Grasping Agent

TL;DR

The paper introduces a data-efficient, equivariant flow-based framework for multi-embodiment grasping that generalizes across grippers with varying DoFs. It leverages SE(3)-equivariant representations and a JAX-based, batched architecture to synthesize pre-grasp poses directly from full-scene point clouds, avoiding reliance on pose estimation. Key contributions include per-joint equivariant gripper embeddings, a multiscale equivariant scene encoder, and a flow-decoding pipeline trained with flow-matching, validated on a large, multi-gripper dataset with both single- and multi-embodiment settings. The work demonstrates competitive performance with state-of-the-art methods while enabling scalable training/inference and releasing open-source code and data to facilitate future research.

Abstract

Multi-embodiment grasping focuses on developing approaches that exhibit generalist behavior across diverse gripper designs. Existing methods often learn the kinematic structure of the robot implicitly and face challenges due to the difficulty of sourcing the required large-scale data. In this work, we present a data-efficient, flow-based, equivariant grasp synthesis architecture that can handle different gripper types with variable degrees of freedom and successfully exploit the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry. Unlike previous equivariant grasping methods, we translated all modules from the ground up to JAX and provide a model with batching capabilities over scenes, grippers, and grasps, resulting in smoother learning, improved performance and faster inference time. Our dataset encompasses grippers ranging from humanoid hands to parallel yaw grippers and includes 25,000 scenes and 20 million grasps.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Equivariant Gripper Embeddings. An initial gripper configuration (a) is represented by a learned feature embedding $z$. After a physical joint rotation $\Delta R$, the gripper is in a new configuration (b). Our method ensures the features are correspondingly transformed via the Wigner-D matrices, $z' = \mathbf{D}(\Delta R)z$, keeping the representation consistent with the physical state.
  • Figure 2: Method Overview. (Left) Grippers are represented with per-joint equivariant embeddings. (a) Full Pipeline. A scene point cloud is encoded into a multi-scale equivariant feature pyramid. Time-conditioned joint features query this pyramid to extract pose and joint information. These scene-aware queries are then decoded to predict flow gradients, which generate the final pre-grasp configuration. (b) Kinematics Encoder. Joint values and kinematics are used to compute per-joint transformations, which are applied to the embeddings via Wigner-D matrices. Parent-child features interact through a dot product, conditioning the queries with per-$\ell$-type weights. (c) Multiscale Tensor Field. Hierarchical scene features are time-conditioned using an equivariant FiLM layer freiberg2025diffusion and projected to a lower dimension. Relative scene-query positions are encoded via a tensor product dependent on direction and length. The resulting aggregated features for each query are fused with the original joint query via a fully connected tensor product.
  • Figure 3: Multi-Embodiment Grasp Synthesis Examples. Renderings of three sampled pre-grasp configurations for five distinct grippers in cluttered scenes. Included grippers (a) ViperX 300s parallel gripper, (b) Franka Emika parallel gripper, (c) DEX-EE dexterous hand, (d) Allegro Hand, and (e) Shadow Hand.