Table of Contents
Fetching ...

Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects

Tai Hoang, Huy Le, Philipp Becker, Vien Anh Ngo, Gerhard Neumann

TL;DR

This work addresses the challenge of manipulating objects with varying geometries and deformable materials by framing robotic tasks as heterogeneous graphs with distinct actuator and object nodes. It introduces Heterogeneous Equivariant Policy (HEPi), a SE(3) equivariant graph-based policy that explicitly models heterogeneity and uses an efficient equivariant MPNN backbone (PONITA-based) to enable robust 3D manipulation. A principled trust-region training approach (TRPL) is employed to stabilize on-policy learning, and a new seven-task benchmark in NVIDIA IsaacLab demonstrates improved performance, sample efficiency, and generalization over Transformer and non-heterogeneous baselines, especially in complex 3D scenarios like Cloth-Hanging and multi-agent insertions. The work advances geometric RL in robotics by combining explicit heterogeneity with SE(3) symmetry, yielding practical impact for dexterous manipulation of rigid and deformable objects in 3D spaces, while outlining avenues for incorporating full robot morphology and vision-based perception.

Abstract

Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterogeneous graph that comprises smaller sub-graphs, such as actuators and objects, accompanied by different edge types describing their interactions. This graph representation serves as a unified structure for both rigid and deformable objects tasks, and can be extended further to tasks comprising multiple actuators. To evaluate this setup, we present a novel and challenging reinforcement learning benchmark, including rigid insertion of diverse objects, as well as rope and cloth manipulation with multiple end-effectors. These tasks present a large search space, as both the initial and target configurations are uniformly sampled in 3D space. To address this issue, we propose a novel graph-based policy model, dubbed Heterogeneous Equivariant Policy (HEPi), utilizing $SE(3)$ equivariant message passing networks as the main backbone to exploit the geometric symmetry. In addition, by modeling explicit heterogeneity, HEPi can outperform Transformer-based and non-heterogeneous equivariant policies in terms of average returns, sample efficiency, and generalization to unseen objects. Our project page is available at https://thobotics.github.io/hepi.

Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects

TL;DR

This work addresses the challenge of manipulating objects with varying geometries and deformable materials by framing robotic tasks as heterogeneous graphs with distinct actuator and object nodes. It introduces Heterogeneous Equivariant Policy (HEPi), a SE(3) equivariant graph-based policy that explicitly models heterogeneity and uses an efficient equivariant MPNN backbone (PONITA-based) to enable robust 3D manipulation. A principled trust-region training approach (TRPL) is employed to stabilize on-policy learning, and a new seven-task benchmark in NVIDIA IsaacLab demonstrates improved performance, sample efficiency, and generalization over Transformer and non-heterogeneous baselines, especially in complex 3D scenarios like Cloth-Hanging and multi-agent insertions. The work advances geometric RL in robotics by combining explicit heterogeneity with SE(3) symmetry, yielding practical impact for dexterous manipulation of rigid and deformable objects in 3D spaces, while outlining avenues for incorporating full robot morphology and vision-based perception.

Abstract

Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterogeneous graph that comprises smaller sub-graphs, such as actuators and objects, accompanied by different edge types describing their interactions. This graph representation serves as a unified structure for both rigid and deformable objects tasks, and can be extended further to tasks comprising multiple actuators. To evaluate this setup, we present a novel and challenging reinforcement learning benchmark, including rigid insertion of diverse objects, as well as rope and cloth manipulation with multiple end-effectors. These tasks present a large search space, as both the initial and target configurations are uniformly sampled in 3D space. To address this issue, we propose a novel graph-based policy model, dubbed Heterogeneous Equivariant Policy (HEPi), utilizing equivariant message passing networks as the main backbone to exploit the geometric symmetry. In addition, by modeling explicit heterogeneity, HEPi can outperform Transformer-based and non-heterogeneous equivariant policies in terms of average returns, sample efficiency, and generalization to unseen objects. Our project page is available at https://thobotics.github.io/hepi.

Paper Structure

This paper contains 65 sections, 1 theorem, 19 equations, 22 figures, 6 tables.

Key Result

Proposition 3.1

For $\text{MPNN}$ + $\text{VN}_{\text{Local}}$, the Jacobian $\partial \mathbf{f}_v^{\text{act}}/\partial \mathbf{f}_u^{\text{obj}}$ is independent of $u$ whenever object node $u$ and actuator node $v$ are separated by more than 2 hops. In contrast, HEPi with node connections and updates as describe

Figures (22)

  • Figure 1: Left: A Cloth-Hanging task represented by a heterogeneous graph that comprises two disjoint node sets, objects, and actuators, connected through directed, fully-connected inter-edges. Intra-edges occur within each set (both objects and actuators) to capture relationships within clusters. Information is aggregated from objects to actuators via inter-edges. The target distance is absorbed into the feature representation rather than treated as a separate node type. Right: Overview of Heterogeneous Equivariant Policy (HEPi), consisting of multiple Equivariant Message Passing Networks (EMPNs) process the graph, and the outputs are aggregated to generate the final action.
  • Figure 2: Illustration of our diverse and challenging manipulation tasks, involving both rigid and deformable objects. These tasks require precise control under complex geometric constraints, coordination between multiple actuators, and handling of intricate interactions between objects and actuators. The variety of tasks highlights the need for policies that can understand the geometric structure in large observation and action spaces.
  • Figure 3: Evaluation curves for our seven manipulation tasks, comparing HEPi (ours), EMPN, and Transformer baselines. Results are averaged over 10 seeds, using IQM with 95% confidence intervals. HEPi consistently outperforms EMPN and Transformer in tasks requiring complex exploration and heterogeneity handling, such as rigid-insertion-two-agents-3D, rigid-pushing-2D and cloth-hanging-3D.
  • Figure 4: Performance of different models on the Cloth-Hanging task across varying sample spaces. Overall, performance improves as the sample space decreases. In terms of final performance, heterogeneous models outperform homogeneous baselines in most cases, demonstrating the benefits of explicit heterogeneity modeling. Additionally, applying equivariant constraints is critical for achieving superior performance in 3D tasks. More results can be found in Appendix \ref{['appx:further_exp']}.
  • Figure 5: Left: Analysis of noise sensitivity and scalability to high-resolution objects in the Rigid-Pushing task. Heatmaps show average returns under varying levels of artificial Gaussian noise in position and velocity inputs for both low-resolution and high-resolution objects. A single HEPi agent, trained on a low-resolution with additive Gaussian Noise ($\sigma = 0.01$), was used for all evaluations. Right: Generalization performance on Rigid-Sliding and Rigid-Insertion tasks. Models are trained on one object (plus), two objects (plus, star), and three objects (plus, star, pentagon) and tested on the remaining unseen objects. Overall, generalizes well to unseen objects, performs consistently across resolutions, and handles noise effectively, making it suitable for real-world tasks.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Proposition 3.1
  • proof