Table of Contents
Fetching ...

PhysGraph: Physically-Grounded Graph-Transformer Policies for Bimanual Dexterous Hand-Tool-Object Manipulation

Runfa Blark Li, David Kim, Xinshuang Liu, Keito Suzuki, Dwait Bhatt, Nikola Raicevic, Xin Lin, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

TL;DR

This work proposes a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties, which allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards.

Abstract

Bimanual dexterous manipulation for tool use remains a formidable challenge in robotics due to the high-dimensional state space and complicated contact dynamics. Existing methods naively represent the entire system state as a single configuration vector, disregarding the rich structural and topological information inherent to articulated hands. We present PhysGraph, a physically-grounded graph transformer policy designed explicitly for challenging bimanual hand-tool-object manipulation. Unlike prior works, we represent the bimanual system as a kinematic graph and introduce per-link tokenization to preserve fine-grained local state information. We propose a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties. This allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards. Extensive experiments show that PhysGraph significantly outperforms baseline - ManipTrans in manipulation precision and task success rates while using only 51% of the parameters of ManipTrans. Furthermore, the inherent topological flexibility of our architecture shows qualitative zero-shot transfer to unseen tool/object geometries, and is sufficiently general to be trained on three robotic hands (Shadow, Allegro, Inspire).

PhysGraph: Physically-Grounded Graph-Transformer Policies for Bimanual Dexterous Hand-Tool-Object Manipulation

TL;DR

This work proposes a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties, which allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards.

Abstract

Bimanual dexterous manipulation for tool use remains a formidable challenge in robotics due to the high-dimensional state space and complicated contact dynamics. Existing methods naively represent the entire system state as a single configuration vector, disregarding the rich structural and topological information inherent to articulated hands. We present PhysGraph, a physically-grounded graph transformer policy designed explicitly for challenging bimanual hand-tool-object manipulation. Unlike prior works, we represent the bimanual system as a kinematic graph and introduce per-link tokenization to preserve fine-grained local state information. We propose a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties. This allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards. Extensive experiments show that PhysGraph significantly outperforms baseline - ManipTrans in manipulation precision and task success rates while using only 51% of the parameters of ManipTrans. Furthermore, the inherent topological flexibility of our architecture shows qualitative zero-shot transfer to unseen tool/object geometries, and is sufficiently general to be trained on three robotic hands (Shadow, Allegro, Inspire).
Paper Structure (12 sections, 13 equations, 6 figures, 2 tables)

This paper contains 12 sections, 13 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: PhysGraph is a physically-grounded graph transformer policy for bimanual hand-tool-object manipulation. Top row: PhysGraph policy trained and tested on diverse bimanual tool-use tasks with different robot hands (Allegro, ArtiMano, Shadow). Bottom row: Zero-shot policy transfer to unseen tool and object for similar tasks, e.g., trained on "slicing bread with chop knife" and tested on "cutting apple with fruit knife". Our code and additional results are available at: https://blarklee.github.io/PhysGraph_website_official/
  • Figure 2: Overview of PhysGraph:(a) Physical Graph & Tokenization: The bimanual workspace is modeled as a graph where nodes represent links of the left/right hands, tools, and objects and edges represent kinematic (joints) or dynamic (contact) interactions. State-based multi-modal observations for each link are processed into parallel input tokens. (b) Physically-Grounded Bias Generator: This module computes four distinct biases (detailed in Fig. \ref{['fig:bias_details']}), aggregated into a composite bias matrix $\mathcal{M}$. The biases are applied via head-specific masking, allowing different attention heads to focus on specific physical relationships. (c) Graph Transformer Encoder: The tokenized inputs are processed by a transformer encoder where the Multi-Head Attention (MHA) is modulated by the generated bias in (b). (d) Output Heads: The globally encoded ${[POL]}$ token is passed to MLP heads to predict the policy action distribution ($\mu$, $\sigma$) and value function ($V$).
  • Figure 3: Details of the Physically-Grounded Biases.(a) Spatial Bias: Encodes the topological structure of the hand graph by mapping the shortest path distance $d(u, v)$ between nodes to learned scalar embeddings $b_{sp}^{(h)}$, capturing the "hop" distance in the kinematic chain. (b) Edge Bias: Injects structural information by assigning distinct learned embeddings $b_{edge}^{(h)}$ indexed by the edge type $\tau$ connecting two nodes. (c) Geometric Bias: Incorporates spatial proximity in Cartesian space. It utilizes a Radial Basis Function (RBF) kernel $\kappa(u, v)$ to weight attention based on the Euclidean distance $\|\mathbf{p}_u - \mathbf{p}_v\|^2$ between node positions. (d) Anatomical Priors: Encodes knowledge regarding anatomical hand kinematics. The Serial Mask ($M_{ser}$) highlights dependencies along the kinematic chain of a single finger, while the Synergy Mask ($M_{syn}$) promotes attention between corresponding links across different fingers with the same anatomical levels.
  • Figure 4: Qualitative results on bimanual tool-use tasks of Oakink2 dataset. Left to right: Ground truth, PhysGraph (ours), ManipTrans. Top to bottom: 0837f@0, e1fa6@0, 817fb@0, 1292e@0. Please refer to the corresponding videos on our website.
  • Figure 5: Zero-Shot Policy Transferwithout finetuning. Left to right: Ground truth, PhysGraph (ours), ManipTrans. Top to bottom (sequence trained $\rightarrow$ sequence deployed): 0837f@0 $\rightarrow$ 9fc3e@0 (chop knife $\rightarrow$ fruit knife; bread $\rightarrow$ apple), e1fa6@0 $\rightarrow$ 66c7f@0, 1292e@0 $\rightarrow$ b9695@0.
  • ...and 1 more figures