Table of Contents
Fetching ...

Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges

Tao Zhong, Jonah Buchanan, Christine Allen-Blanchette

TL;DR

This work addresses transferring grasp intent across dexterous hands with different morphologies using vision, without paired demonstrations. It frames grasp translation as a Schrödinger Bridge–based probabilistic transport between source and target grasp distributions conditioned on object observations, optimized via latent SF$^2$M with an entropic OT plan $\pi_{\varepsilon}^*$. A two-stage latent pipeline uses a VAE to encode source observations into latent $z$, a latent Schrödinger Bridge translates to $z'$ under ground costs, and a decoder yields the target grasp, guided by four physics-informed OT costs: $d_{\mathrm{pose}}$, $d_{\mathrm{contact}}$, $d_{\mathrm{wrench}}$, and $d_{\mathrm{jac}}$. Experiments on the MultiGripperGrasp dataset show improved grasp success and functional alignment across hand–object pairs, enabling semantically meaningful grasp transfer without hand-specific simulation, and highlighting the potential of distributional transport for generalizable manipulation across heterogeneous hardware.

Abstract

We propose a new approach to vision-based dexterous grasp translation, which aims to transfer grasp intent across robotic hands with differing morphologies. Given a visual observation of a source hand grasping an object, our goal is to synthesize a functionally equivalent grasp for a target hand without requiring paired demonstrations or hand-specific simulations. We frame this problem as a stochastic transport between grasp distributions using the Schrödinger Bridge formalism. Our method learns to map between source and target latent grasp spaces via score and flow matching, conditioned on visual observations. To guide this translation, we introduce physics-informed cost functions that encode alignment in base pose, contact maps, wrench space, and manipulability. Experiments across diverse hand-object pairs demonstrate our approach generates stable, physically grounded grasps with strong generalization. This work enables semantic grasp transfer for heterogeneous manipulators and bridges vision-based grasping with probabilistic generative modeling. Additional details at https://grasp2grasp.github.io/

Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges

TL;DR

This work addresses transferring grasp intent across dexterous hands with different morphologies using vision, without paired demonstrations. It frames grasp translation as a Schrödinger Bridge–based probabilistic transport between source and target grasp distributions conditioned on object observations, optimized via latent SFM with an entropic OT plan . A two-stage latent pipeline uses a VAE to encode source observations into latent , a latent Schrödinger Bridge translates to under ground costs, and a decoder yields the target grasp, guided by four physics-informed OT costs: , , , and . Experiments on the MultiGripperGrasp dataset show improved grasp success and functional alignment across hand–object pairs, enabling semantically meaningful grasp transfer without hand-specific simulation, and highlighting the potential of distributional transport for generalizable manipulation across heterogeneous hardware.

Abstract

We propose a new approach to vision-based dexterous grasp translation, which aims to transfer grasp intent across robotic hands with differing morphologies. Given a visual observation of a source hand grasping an object, our goal is to synthesize a functionally equivalent grasp for a target hand without requiring paired demonstrations or hand-specific simulations. We frame this problem as a stochastic transport between grasp distributions using the Schrödinger Bridge formalism. Our method learns to map between source and target latent grasp spaces via score and flow matching, conditioned on visual observations. To guide this translation, we introduce physics-informed cost functions that encode alignment in base pose, contact maps, wrench space, and manipulability. Experiments across diverse hand-object pairs demonstrate our approach generates stable, physically grounded grasps with strong generalization. This work enables semantic grasp transfer for heterogeneous manipulators and bridges vision-based grasping with probabilistic generative modeling. Additional details at https://grasp2grasp.github.io/

Paper Structure

This paper contains 31 sections, 21 equations, 6 figures, 13 tables, 3 algorithms.

Figures (6)

  • Figure 1: (top) Comparison of morphology and scale across hands. (bottom) Object-agnostic methods often miss fine-grained contacts, leading to invalid grasps. Our method produces contact-consistent, stable grasps across diverse hand morphologies.
  • Figure 2: Illustration of the Schrödinger Bridge Models. The Schrödinger Bridge process transports samples from a source distribution ($q_{0}$) to a target ($q_{1}$). At an intermediate point $x_t$ on a Brownian bridge trajectory between coupled samples $(x_0, x_1)$, our model learns the conditional flow drift $u_t^\circ$ that drives transport and the conditional score $\nabla \log p_t$ that corrects the path.
  • Figure 3: Architecture overview.Blue modules correspond to stage 1: the source hand observation is encoded via a VAE. Orange modules correspond to stage 2: the latent is translated using a U-ViT Schrödinger Bridge model conditioned on object shape and contact anchors. The translated latent is decoded to produce the target hand grasp.
  • Figure 4: (left) Qualitative results. The ‘Pose', ‘Contact', ‘GWH', and ‘Jacobian' columns show results from our method when trained using the respective OT cost functions from Sec. \ref{['sec: otcost']}. Our method generates stable and consistent grasps across different OT variants, even when the source grasp is sub-optimal, where RFP khargonkar2024robotfingerprint fails. (right) Failure Modes where our method struggles with thin-shell objects due to challenging geometry.
  • Figure 5: More qualitative examples. Each column shows a source grasp and its translated output for different methods. Our method consistently recovers physically plausible grasps across tasks, while RFP frequently fails, especially with noisy source grasps.
  • ...and 1 more figures