Table of Contents
Fetching ...

Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

Eugenio Chisari, Nick Heppert, Max Argus, Tim Welschehold, Thomas Brox, Abhinav Valada

TL;DR

This paper investigates the application of CFM in the context of robotic policy learning and specifically study the interplay with the other design choices required to build an imitation learning algorithm, showing that CFM gives the best performance when combined with point cloud input observations.

Abstract

Learning from expert demonstrations is a promising approach for training robotic manipulation policies from limited data. However, imitation learning algorithms require a number of design choices ranging from the input modality, training objective, and 6-DoF end-effector pose representation. Diffusion-based methods have gained popularity as they enable predicting long-horizon trajectories and handle multimodal action distributions. Recently, Conditional Flow Matching (CFM) (or Rectified Flow) has been proposed as a more flexible generalization of diffusion models. In this paper, we investigate the application of CFM in the context of robotic policy learning and specifically study the interplay with the other design choices required to build an imitation learning algorithm. We show that CFM gives the best performance when combined with point cloud input observations. Additionally, we study the feasibility of a CFM formulation on the SO(3) manifold and evaluate its suitability with a simplified example. We perform extensive experiments on RLBench which demonstrate that our proposed PointFlowMatch approach achieves a state-of-the-art average success rate of 67.8% over eight tasks, double the performance of the next best method.

Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching

TL;DR

This paper investigates the application of CFM in the context of robotic policy learning and specifically study the interplay with the other design choices required to build an imitation learning algorithm, showing that CFM gives the best performance when combined with point cloud input observations.

Abstract

Learning from expert demonstrations is a promising approach for training robotic manipulation policies from limited data. However, imitation learning algorithms require a number of design choices ranging from the input modality, training objective, and 6-DoF end-effector pose representation. Diffusion-based methods have gained popularity as they enable predicting long-horizon trajectories and handle multimodal action distributions. Recently, Conditional Flow Matching (CFM) (or Rectified Flow) has been proposed as a more flexible generalization of diffusion models. In this paper, we investigate the application of CFM in the context of robotic policy learning and specifically study the interplay with the other design choices required to build an imitation learning algorithm. We show that CFM gives the best performance when combined with point cloud input observations. Additionally, we study the feasibility of a CFM formulation on the SO(3) manifold and evaluate its suitability with a simplified example. We perform extensive experiments on RLBench which demonstrate that our proposed PointFlowMatch approach achieves a state-of-the-art average success rate of 67.8% over eight tasks, double the performance of the next best method.
Paper Structure (14 sections, 3 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 14 sections, 3 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: Diffusion and CFM are repeatedly applied to a noisy trajectory, thereby iteratively yielding a clean trajectory that can be executed on the robot. The generative models also take as input encoded observations.
  • Figure 2: Example images of the eight RLBench tasks.
  • Figure 3: Comparison of CFM and DDIM for varying values of the number of inference steps $k$. We compare the inference time ($\downarrow$) measured in [ms] as well as the inference FPS ($\uparrow$) in [Hz] against overall success rate ($\uparrow$) for both formulations.
  • Figure 4: We demonstrate PointFlowMatch on a real robotic setup. We evaluate on two tasks: open box and sponge on plate.
  • Figure 5: Simplified Example. The left figure shows the edge case when random samples are close to the opposite pole of the target sample. Here the $SO(3)$ formulation presents a discontinuity which makes learning more difficult. In the three right figures, we visualize the mean error during inference across different sampling locations for our different formulations. We mark the target with a red cross. One observes that for the Euclidean formulation the error is lower for initial sample points along the axis orthogonal to the target. This is expected as values sampled along the line are naturally mapped to the target when normalized. On the other side in the last figure, one observes higher errors close to the pole. Additionally, a training data bias is visible as the error is higher on one side of the discontinuity.