Table of Contents
Fetching ...

Latent Space Planning for Multi-Object Manipulation with Environment-Aware Relational Classifiers

Yixuan Huang, Nichols Crawford Taylor, Adam Conkey, Weiyu Liu, Tucker Hermans

TL;DR

This work introduces environment-aware relational dynamics for multi-object manipulation from partial-view point clouds, proposing two architectures—eRDTransformer and RD-GNN—that learn latent-space dynamics to predict inter-object and object-environment relations. The models support sequential planning to satisfy logical relation goals without requiring explicit 3D object models, and they demonstrate reliable sim-to-real transfer without fine-tuning. A graph-search planning method with subgoal enumeration and CEM enables efficient multi-step plans, with extensive simulation and real-world experiments showing superior performance of the transformer-based latent dynamics. Key findings include the advantage of relational supervision over pose supervision and the superior generalization of transformer-based dynamics to larger, more diverse datasets and complex environments.

Abstract

Objects rarely sit in isolation in everyday human environments. If we want robots to operate and perform tasks in our human environments, they must understand how the objects they manipulate will interact with structural elements of the environment for all but the simplest of tasks. As such, we'd like our robots to reason about how multiple objects and environmental elements relate to one another and how those relations may change as the robot interacts with the world. We examine the problem of predicting inter-object and object-environment relations between previously unseen objects and novel environments purely from partial-view point clouds. Our approach enables robots to plan and execute sequences to complete multi-object manipulation tasks defined from logical relations. This removes the burden of providing explicit, continuous object states as goals to the robot. We explore several different neural network architectures for this task. We find the best performing model to be a novel transformer-based neural network that both predicts object-environment relations and learns a latent-space dynamics function. We achieve reliable sim-to-real transfer without any fine-tuning. Our experiments show that our model understands how changes in observed environmental geometry relate to semantic relations between objects. We show more videos on our website: https://sites.google.com/view/erelationaldynamics.

Latent Space Planning for Multi-Object Manipulation with Environment-Aware Relational Classifiers

TL;DR

This work introduces environment-aware relational dynamics for multi-object manipulation from partial-view point clouds, proposing two architectures—eRDTransformer and RD-GNN—that learn latent-space dynamics to predict inter-object and object-environment relations. The models support sequential planning to satisfy logical relation goals without requiring explicit 3D object models, and they demonstrate reliable sim-to-real transfer without fine-tuning. A graph-search planning method with subgoal enumeration and CEM enables efficient multi-step plans, with extensive simulation and real-world experiments showing superior performance of the transformer-based latent dynamics. Key findings include the advantage of relational supervision over pose supervision and the superior generalization of transformer-based dynamics to larger, more diverse datasets and complex environments.

Abstract

Objects rarely sit in isolation in everyday human environments. If we want robots to operate and perform tasks in our human environments, they must understand how the objects they manipulate will interact with structural elements of the environment for all but the simplest of tasks. As such, we'd like our robots to reason about how multiple objects and environmental elements relate to one another and how those relations may change as the robot interacts with the world. We examine the problem of predicting inter-object and object-environment relations between previously unseen objects and novel environments purely from partial-view point clouds. Our approach enables robots to plan and execute sequences to complete multi-object manipulation tasks defined from logical relations. This removes the burden of providing explicit, continuous object states as goals to the robot. We explore several different neural network architectures for this task. We find the best performing model to be a novel transformer-based neural network that both predicts object-environment relations and learns a latent-space dynamics function. We achieve reliable sim-to-real transfer without any fine-tuning. Our experiments show that our model understands how changes in observed environmental geometry relate to semantic relations between objects. We show more videos on our website: https://sites.google.com/view/erelationaldynamics.
Paper Structure (29 sections, 1 equation, 14 figures, 2 tables)

This paper contains 29 sections, 1 equation, 14 figures, 2 tables.

Figures (14)

  • Figure 2: Taking a segmented, partial-view point cloud as input, we first process it using PointConv to generate segment-specific features. We then pass these features into an encoder to predict a latent state $\mathbf{X}$. We can decode $\mathbf{X}$ to predict both if the segment is a movable object and relations between objects and environment segments. By learning an action-conditioned latent-space dynamic model, our approach can be used to solve multi-step planning problems. In green we highlight relations that satisfy relations in the logical goal $\mathbf{g}$.
  • Figure 3: For the same initial scene (left) we show different valid states found by our planner and model for two different goal settings. For the first goal relation, the robot can either pick the green object or the red object to place atop the yellow object. For the second goal relation, the robot can either push the green object or pick-and-place the green object to deconstruct the towers.
  • Figure 4: Visualization of our logical subgoal graph search. The root node of the tree contains the initial state encoded from the observed scene as well as an empty subgoal and null action. The search prioritizes longer subgoals first to induce shorter plans. The green shaded nodes represent satisfied subgoals. If a satisfied subgoal matches the given goal the search ends.
  • Figure 5: Comparing planning success rate of the different models as a function of (left) the number of objects in the scene, (middle) the number of relations specified in the goal, and (right) the number of steps. The legend applies to all three plots. We see that eRDTransformer, RD-GNN, and RD-PE-GNN achieve comparable performance while significantly outperforming the baseline models. The success rate drops for all models as we specify more relations in the goal. Even when fully constrained the top performing models achieve high success rates.
  • Figure 6: Number of successes on real world YCB object manipulation tasks. We compare results for a varying number of objects and varying plan horizon length as denoted by the horizontal labels. We ran 5 trials for each task per model.
  • ...and 9 more figures