Table of Contents
Fetching ...

Generative Factor Chaining: Coordinated Manipulation with Diffusion-based Factor Graph

Utkarsh A. Mishra, Yongxin Chen, Danfei Xu

TL;DR

Generative Factor Chaining (GFC) tackles long-horizon manipulation planning for multi-arm systems by representing states and task constraints as a spatial-temporal factor graph and learning short-horizon skill factors as diffusion models. The joint plan is obtained by composing spatial constraint factors with temporal skill factors into a plan-level distribution and sampling via reverse diffusion, enabling parallel and dependent chaining. A modular plug-and-play skill library plus external spatial constraints supports zero-shot generalization to unseen task-object combinations. The method is demonstrated in simulation and on real bimanual Franka Panda hardware, showing robust long-horizon planning and coordination with improved performance on complex multi-arm tasks. Formally, the plan distribution is $p(\tau) \propto \big(\prod_{\\pi_k} p_{\\pi_k}(S_{\\pi_k},a_{\\pi_k},S'_{\\pi_k})\big)$, illustrating how factor distributions combine to yield feasible plans for sampling via diffusion-based inference.

Abstract

Learning to plan for multi-step, multi-manipulator tasks is notoriously difficult because of the large search space and the complex constraint satisfaction problems. We present Generative Factor Chaining~(GFC), a composable generative model for planning. GFC represents a planning problem as a spatial-temporal factor graph, where nodes represent objects and robots in the scene, spatial factors capture the distributions of valid relationships among nodes, and temporal factors represent the distributions of skill transitions. Each factor is implemented as a modular diffusion model, which are composed during inference to generate feasible long-horizon plans through bi-directional message passing. We show that GFC can solve complex bimanual manipulation tasks and exhibits strong generalization to unseen planning tasks with novel combinations of objects and constraints. More details can be found at: https://generative-fc.github.io/

Generative Factor Chaining: Coordinated Manipulation with Diffusion-based Factor Graph

TL;DR

Generative Factor Chaining (GFC) tackles long-horizon manipulation planning for multi-arm systems by representing states and task constraints as a spatial-temporal factor graph and learning short-horizon skill factors as diffusion models. The joint plan is obtained by composing spatial constraint factors with temporal skill factors into a plan-level distribution and sampling via reverse diffusion, enabling parallel and dependent chaining. A modular plug-and-play skill library plus external spatial constraints supports zero-shot generalization to unseen task-object combinations. The method is demonstrated in simulation and on real bimanual Franka Panda hardware, showing robust long-horizon planning and coordination with improved performance on complex multi-arm tasks. Formally, the plan distribution is , illustrating how factor distributions combine to yield feasible plans for sampling via diffusion-based inference.

Abstract

Learning to plan for multi-step, multi-manipulator tasks is notoriously difficult because of the large search space and the complex constraint satisfaction problems. We present Generative Factor Chaining~(GFC), a composable generative model for planning. GFC represents a planning problem as a spatial-temporal factor graph, where nodes represent objects and robots in the scene, spatial factors capture the distributions of valid relationships among nodes, and temporal factors represent the distributions of skill transitions. Each factor is implemented as a modular diffusion model, which are composed during inference to generate feasible long-horizon plans through bi-directional message passing. We show that GFC can solve complex bimanual manipulation tasks and exhibits strong generalization to unseen planning tasks with novel combinations of objects and constraints. More details can be found at: https://generative-fc.github.io/
Paper Structure (20 sections, 9 equations, 17 figures, 8 tables)

This paper contains 20 sections, 9 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Factor graph for a multi-arm coordination task. Our factor graph-based planning formulation solves for a sequence of spatial factor graphs from the initial state to a goal factor by chaining them using temporal skill factors. The figure illustrates the temporal evolution of a factor graph by executing single or multiple skills sequentially or in-parallel to handover a hammer, pick up a nail, and coordinate both arms to strike the nail. Task: The task objective is to place the hammer inside the box. However, since the left arm cannot reach the box, the hammer is handed over to the right arm such that the right arm can complete the task. (a) Inputs: The initial scene and a symbolically feasible spatial-temporal factor graph plan to complete the goal objective. (b) GFC: We formulate all factors as distributions of the nodes connected to them. GFC represents spatial factors as classifiers and temporal factors as diffusion models. We leverage compositionality of diffusion models to compose spatial-temporal distributions and find the joint distribution of the complete plan directly at inference. Finally, samples drawn from such a joint distribution are symbolically and geometrically feasible solutions of the whole plan. (c) Output: A sequence of skill choices and optimizer continuous parameters executed on robots with parameterized skill controllers.
  • Figure 2: (Left) Parallel independent chaining The figure shows the execution of two skills ($\pi_1$ and $\pi_2$) in-parallel on two independent sets of nodes (L, C and R, M) to modify their existing factors (Grasped). The two independent executions can be connected via external factors $\mu_1$ (FixedTransform) introducing spatial dependencies between nodes C and M. (Right) Parallel dependent chaining The figure shows overlapping nodes of interest while parallel execution of two skills. The pot is to be picked by using both arms simultaneously. The effect of this is resulting factors (Grasped) between (L, P and R, P) and external factor $\mu_2$ (FixedTransform) between L and R. Overlapping nodes satisfy both skill's temporal effects.
  • Figure 3: Evaluation tasks: (a) Hook reach: Hook is used to pull an object in the robot's workspace followed by other skills. (b) Constrained packing: Multiple objects must be placed on a rack without collisions. (c) Rearrangement push: Hook is used to push objects to a desired arrangement followed by other skills. (d) Hammer place: A hammer must be handed over to another manipulator and placed in a target box. (e) Hammer nail: A hammer must be handed over to another manipulator and a configuration must be achieved to strike a nail. (f) Pour cup: Cups must be brought in a configuration that allows successful pouring from one to another.
  • Figure 4: Evaluating GFC on bimanual reorientation where two arms simultaneously pick and reorient a pot.
  • Figure 5: Linear chaining has limitations. Baseline methods with linear chain assumption suffers from performance drop when given inconsistent skill chains, where steps with sequential dependencies are swapped. GFC retains high success rate using the parallel skeleton.
  • ...and 12 more figures