Table of Contents
Fetching ...

PerFACT: Motion Policy with LLM-Powered Dataset Synthesis and Fusion Action-Chunking Transformers

Davood Soleymanzadeh, Xiao Liang, Minghui Zheng

TL;DR

The paper tackles the generalization gap in neural motion planning arising from limited, manually designed datasets. It introduces MotionGeneralizer, an LLM-guided procedural framework that creates diverse, semantically feasible workspaces and planning problems at scale, and MπNetsFusion, a fusion bottleneck transformer that leverages action chunking to plan end-to-end with multiple sensing modalities. By generating a 3.5M-trajectory dataset, training a 4.15M-parameter model, and achieving substantial speedups over baselines, the approach delivers faster, competitive planning while maintaining robustness in held-out and real-world tasks. The work demonstrates practical impact by enabling scalable data-driven motion planning for robotic manipulators in cluttered environments and points to extensions for dynamic settings and broader embodiment applicability.

Abstract

Deep learning methods have significantly enhanced motion planning for robotic manipulators by leveraging prior experiences within planning datasets. However, state-of-the-art neural motion planners are primarily trained on small datasets collected in manually generated workspaces, limiting their generalizability to out-of-distribution scenarios. Additionally, these planners often rely on monolithic network architectures that struggle to encode critical planning information. To address these challenges, we introduce Motion Policy with Dataset Synthesis powered by large language models (LLMs) and Fusion Action-Chunking Transformers (PerFACT), which incorporates two key components. Firstly, a novel LLM-powered workspace generation method, MotionGeneralizer, enables large-scale planning data collection by producing a diverse set of semantically feasible workspaces. Secondly, we introduce Fusion Motion Policy Networks (MpiNetsFusion), a generalist neural motion planner that uses a fusion action-chunking transformer to better encode planning signals and attend to multiple feature modalities. Leveraging MotionGeneralizer, we collect 3.5M trajectories to train and evaluate MpiNetsFusion against state-of-the-art planners, which shows that the proposed MpiNetsFusion can plan several times faster on the evaluated tasks.

PerFACT: Motion Policy with LLM-Powered Dataset Synthesis and Fusion Action-Chunking Transformers

TL;DR

The paper tackles the generalization gap in neural motion planning arising from limited, manually designed datasets. It introduces MotionGeneralizer, an LLM-guided procedural framework that creates diverse, semantically feasible workspaces and planning problems at scale, and MπNetsFusion, a fusion bottleneck transformer that leverages action chunking to plan end-to-end with multiple sensing modalities. By generating a 3.5M-trajectory dataset, training a 4.15M-parameter model, and achieving substantial speedups over baselines, the approach delivers faster, competitive planning while maintaining robustness in held-out and real-world tasks. The work demonstrates practical impact by enabling scalable data-driven motion planning for robotic manipulators in cluttered environments and points to extensions for dynamic settings and broader embodiment applicability.

Abstract

Deep learning methods have significantly enhanced motion planning for robotic manipulators by leveraging prior experiences within planning datasets. However, state-of-the-art neural motion planners are primarily trained on small datasets collected in manually generated workspaces, limiting their generalizability to out-of-distribution scenarios. Additionally, these planners often rely on monolithic network architectures that struggle to encode critical planning information. To address these challenges, we introduce Motion Policy with Dataset Synthesis powered by large language models (LLMs) and Fusion Action-Chunking Transformers (PerFACT), which incorporates two key components. Firstly, a novel LLM-powered workspace generation method, MotionGeneralizer, enables large-scale planning data collection by producing a diverse set of semantically feasible workspaces. Secondly, we introduce Fusion Motion Policy Networks (MpiNetsFusion), a generalist neural motion planner that uses a fusion action-chunking transformer to better encode planning signals and attend to multiple feature modalities. Leveraging MotionGeneralizer, we collect 3.5M trajectories to train and evaluate MpiNetsFusion against state-of-the-art planners, which shows that the proposed MpiNetsFusion can plan several times faster on the evaluated tasks.

Paper Structure

This paper contains 25 sections, 14 equations, 15 figures, 9 tables, 3 algorithms.

Figures (15)

  • Figure 1: PerFACT's various components and real-world deployment. Firstly, MotionGeneralizer (Section \ref{['sec: motiongeneralizer']}) integrates procedural primitive generation with a large language model's (LLM) -GPT-4achiam2023gpt- reasoning capabilities to produce diverse workspaces for training generalist neural motion planners. Then, MotionGeneralizer generates robot-agnostic, scene-specific motion-planning problems (Section \ref{['subsec: planningproblems']}) and integrates with state-of-the-art motion planners (cuRobosundaralingam2023curobo) to generate a large-scale planning dataset. The dataset, combined with the perception modality (Section \ref{['subsec: modality']}) from MotionGeneralizer, is leveraged to train the M$\pi$NetsFusion (Section \ref{['sec: motionplanner']}). The sensing modality (Section \ref{['subsec: modality']}) is also utilized within the open-loop rollout of M$\pi$NetsFusion to solve motion-planning problems in held-out evaluation scenarios (Section \ref{['sec:main_eval']}). Finally, M$\pi$NetsFusion is deployed within real-world planning scenarios (Section \ref{['sec: experiments']}). The path profile from start configuration to the goal configuration is demonstrated in frames 2-3.
  • Figure 2: MotionGeneralizer framework for diverse workspace generation. This method first randomly selects a robot type and the number of its surrounding tables, which are created using procedural generation. Next, it prompts the LLM in a few-shot manner (Fine-tuned Large Language Model) to determine the number of primitives for each table. The suggested primitives are then procedurally generated, given the pool of primitives, and the LLM is prompted again in a few-shot manner (Fine-tuned Large Language Model) to specify the location and orientation of each primitive on its respective table, and outputs the workspace. The procedure can be repeated to generate a diverse set of planning workspaces with an arbitrary number (N).
  • Figure 3: Workspace generation: MotionGeneralizer can generate an arbitrary number of cluttered, semantically feasible workspaces given robotic manipulator type (UR5e) and a pool of everyday primitives.
  • Figure 4: Point Cloud Synthesis: MotionGeneralizer's perception module provides a privileged point-cloud scene representation by uniformly sampling on workspace primitives (a) and robotic manipulator at arbitrary configurations (b) to increase spatial awareness for downstream planning tasks.
  • Figure 5: Planning Problem Generation: MotionGeneralizer's problem generator module provides robot-agnostic, scene-specific planning problem poses. Then, any arbitrary robotic manipulator, such as UR5e (top row) or Franka (bottom row), can use its collision-aware inverse kinematics to compute corresponding start-goal configurations of the planning problems.
  • ...and 10 more figures