Table of Contents
Fetching ...

Points2Plans: From Point Clouds to Long-Horizon Plans with Composable Relational Dynamics

Yixuan Huang, Christopher Agia, Jimmy Wu, Tucker Hermans, Jeannette Bohg

TL;DR

Points2Plans tackles long-horizon robotic manipulation from partial-view point clouds by unifying symbolic and geometric reasoning through a transformer-based relational dynamics model trained on single-step transitions. A hybrid latent-geometric rollout paired with an LLM-guided task planner enables efficient planning of manipulation sequences, while a sampling-based parameter planner enforces feasibility against predicate constraints. The framework demonstrates strong generalization to unseen tasks and real-world success (>85%) compared with baselines (~50%), highlighting the viability of composable planning from rich perceptual input. This approach advances scalable, language-driven planning for complex, occluded environments without requiring multi-step demonstrations for training, paving the way for robust, open-world robotics applications.

Abstract

We present Points2Plans, a framework for composable planning with a relational dynamics model that enables robots to solve long-horizon manipulation tasks from partial-view point clouds. Given a language instruction and a point cloud of the scene, our framework initiates a hierarchical planning procedure, whereby a language model generates a high-level plan and a sampling-based planner produces constraint-satisfying continuous parameters for manipulation primitives sequenced according to the high-level plan. Key to our approach is the use of a relational dynamics model as a unifying interface between the continuous and symbolic representations of states and actions, thus facilitating language-driven planning from high-dimensional perceptual input such as point clouds. Whereas previous relational dynamics models require training on datasets of multi-step manipulation scenarios that align with the intended test scenarios, Points2Plans uses only single-step simulated training data while generalizing zero-shot to a variable number of steps during real-world evaluations. We evaluate our approach on tasks involving geometric reasoning, multi-object interactions, and occluded object reasoning in both simulated and real-world settings. Results demonstrate that Points2Plans offers strong generalization to unseen long-horizon tasks in the real world, where it solves over 85% of evaluated tasks while the next best baseline solves only 50%.

Points2Plans: From Point Clouds to Long-Horizon Plans with Composable Relational Dynamics

TL;DR

Points2Plans tackles long-horizon robotic manipulation from partial-view point clouds by unifying symbolic and geometric reasoning through a transformer-based relational dynamics model trained on single-step transitions. A hybrid latent-geometric rollout paired with an LLM-guided task planner enables efficient planning of manipulation sequences, while a sampling-based parameter planner enforces feasibility against predicate constraints. The framework demonstrates strong generalization to unseen tasks and real-world success (>85%) compared with baselines (~50%), highlighting the viability of composable planning from rich perceptual input. This approach advances scalable, language-driven planning for complex, occluded environments without requiring multi-step demonstrations for training, paving the way for robust, open-world robotics applications.

Abstract

We present Points2Plans, a framework for composable planning with a relational dynamics model that enables robots to solve long-horizon manipulation tasks from partial-view point clouds. Given a language instruction and a point cloud of the scene, our framework initiates a hierarchical planning procedure, whereby a language model generates a high-level plan and a sampling-based planner produces constraint-satisfying continuous parameters for manipulation primitives sequenced according to the high-level plan. Key to our approach is the use of a relational dynamics model as a unifying interface between the continuous and symbolic representations of states and actions, thus facilitating language-driven planning from high-dimensional perceptual input such as point clouds. Whereas previous relational dynamics models require training on datasets of multi-step manipulation scenarios that align with the intended test scenarios, Points2Plans uses only single-step simulated training data while generalizing zero-shot to a variable number of steps during real-world evaluations. We evaluate our approach on tasks involving geometric reasoning, multi-object interactions, and occluded object reasoning in both simulated and real-world settings. Results demonstrate that Points2Plans offers strong generalization to unseen long-horizon tasks in the real world, where it solves over 85% of evaluated tasks while the next best baseline solves only 50%.
Paper Structure (28 sections, 7 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 2: Overview of Points2Plans. A partial-view segmented point cloud $\mathbf{o}_1$ is first encoded into the (object-centric) latent state $\mathbf{z}_1$. The latent state $\mathbf{z}_1$ is then decoded into predicates that serve as environment context for the task planning and goal prediction module (e.g., an LLM), from which a task plan $\phi_{1:H}$ and a symbolic goal $\mathcal{G}$ are sampled. Points2Plans then invokes a sampling-based planning procedure to compute continuous parameters $a_{1:H}$ for the manipulation primitives in the task plan $\phi_{1:H}$. Infeasible plans (e.g., collisions) are rejected, and the plan that maximizes the goal likelihood in the final state $\mathbf{z}_{H+1}$ is returned.
  • Figure 3: Points2Plans hybrid rollout strategy.
  • Figure 4: Simulation and real-world results for the Constrained Packing (a-d) and Constrained Retrieval (e-f) tasks. As task complexity increases, Points2Plans significantly outperforms baselines in terms of planning success rate (a-b), position prediction error (c), and predicate classification accuracy (d). Interfacing Points2Plans with an LLM task planner increases planning efficiency (e) and correctness (f). Planning time is shown on a logarithmic scale. Errors bars denote standard deviations across 500 trials.
  • Figure 5: Points2Plans generalizes to unseen long-horizon tasks, whereas the baselines struggle to find collision-free plans.
  • Figure 6: A causal Bayes net to derive Eq. \ref{['eq:planning-objective']}. $\mathcal{G}$ represents the goal predicates, $l$ is the language instruction, $o_1$ is the initial observation, $\phi_{1:H}$ are the task plans, $a_{1:H}$ are the continuous parameters, and $\mathbf{x}_{1:H}$ represent world states (including predicates $\mathbf{r}_{1:H}$ and positions $\mathbf{p}_{1:H}$. Shaded nodes represent observed variables.
  • ...and 5 more figures