Table of Contents
Fetching ...

Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds

Minghua Liu, Xuanlin Li, Zhan Ling, Yangyan Li, Hao Su

TL;DR

<3-5 sentence high-level summary>Frame mining demonstrates that the coordinate frame used to represent input point clouds can dramatically affect learning efficiency and policy quality in 3D robotic manipulation. The authors introduce FrameMiners, especially FrameMiner-MixAction, to adaptively fuse multiple frames (e.g., end-effector, world, target-part) with frame-specific experts and input-dependent weights, achieving on-par or superior performance to single-frame baselines across five tasks and both RL and IL settings. They show the end-effector and target-part frames often yield better sample efficiency, while fusion across frames provides robustness and gains for multi-arm tasks; real-world experiments validate sim-to-real applicability with modest domain transfer. The work argues that a free-lunch-like improvement can be obtained without extra cameras, simply by smarter frame normalization and fusion, with practical implications for deploying point-cloud policies on existing robotic systems.

Abstract

We study how choices of input point cloud coordinate frames impact learning of manipulation skills from 3D point clouds. There exist a variety of coordinate frame choices to normalize captured robot-object-interaction point clouds. We find that different frames have a profound effect on agent learning performance, and the trend is similar across 3D backbone networks. In particular, the end-effector frame and the target-part frame achieve higher training efficiency than the commonly used world frame and robot-base frame in many tasks, intuitively because they provide helpful alignments among point clouds across time steps and thus can simplify visual module learning. Moreover, the well-performing frames vary across tasks, and some tasks may benefit from multiple frame candidates. We thus propose FrameMiners to adaptively select candidate frames and fuse their merits in a task-agnostic manner. Experimentally, FrameMiners achieves on-par or significantly higher performance than the best single-frame version on five fully physical manipulation tasks adapted from ManiSkill and OCRTOC. Without changing existing camera placements or adding extra cameras, point cloud frame mining can serve as a free lunch to improve 3D manipulation learning.

Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds

TL;DR

<3-5 sentence high-level summary>Frame mining demonstrates that the coordinate frame used to represent input point clouds can dramatically affect learning efficiency and policy quality in 3D robotic manipulation. The authors introduce FrameMiners, especially FrameMiner-MixAction, to adaptively fuse multiple frames (e.g., end-effector, world, target-part) with frame-specific experts and input-dependent weights, achieving on-par or superior performance to single-frame baselines across five tasks and both RL and IL settings. They show the end-effector and target-part frames often yield better sample efficiency, while fusion across frames provides robustness and gains for multi-arm tasks; real-world experiments validate sim-to-real applicability with modest domain transfer. The work argues that a free-lunch-like improvement can be obtained without extra cameras, simply by smarter frame normalization and fusion, with practical implications for deploying point-cloud policies on existing robotic systems.

Abstract

We study how choices of input point cloud coordinate frames impact learning of manipulation skills from 3D point clouds. There exist a variety of coordinate frame choices to normalize captured robot-object-interaction point clouds. We find that different frames have a profound effect on agent learning performance, and the trend is similar across 3D backbone networks. In particular, the end-effector frame and the target-part frame achieve higher training efficiency than the commonly used world frame and robot-base frame in many tasks, intuitively because they provide helpful alignments among point clouds across time steps and thus can simplify visual module learning. Moreover, the well-performing frames vary across tasks, and some tasks may benefit from multiple frame candidates. We thus propose FrameMiners to adaptively select candidate frames and fuse their merits in a task-agnostic manner. Experimentally, FrameMiners achieves on-par or significantly higher performance than the best single-frame version on five fully physical manipulation tasks adapted from ManiSkill and OCRTOC. Without changing existing camera placements or adding extra cameras, point cloud frame mining can serve as a free lunch to improve 3D manipulation learning.
Paper Structure (25 sections, 17 figures, 4 tables)

This paper contains 25 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: A 3D point cloud of a dual-arm robot pushing a chair, which can be represented in various coordinate frames without changing camera placements or requiring extra camera views. Our FrameMiner takes as input a point cloud represented in multiple candidate frames and adaptively fuses their merits, resulting in better performance.
  • Figure 2: We study coordinate frame mining on manipulation tasks adapted from OCRTOC liu2021ocrtoc and ManiSkill mu2021maniskill covering various setups (e.g., #arms, mobility, camera). Simulation is fully physical.
  • Figure 3: Architecture of a 3D point cloud-based agent, which is optimized by actor-critic RL algorithms. We study coordinate frame selection of input (fused) point cloud.
  • Figure 4: Illustration of four coordinate frames, which provide different alignments across time steps. We visualize three point clouds (three time steps) of an OpenCabinetDoor trajectory. Each row shows the same point cloud represented in different coordinate frames. Please zoom in for details. Robot arm, cabinet door handle, cabinet door, and cabinet body are colored in blue, red, yellow, and brown, respectively. RGB arrows indicate the corresponding origin and axes for each frame. Since the point clouds used for policy learning can be rather sparse, we show dense point clouds here for better visualization.
  • Figure 5: Comparison of four coordinate frames on five fully-physical manipulation tasks. The (fused) point cloud is transformed to a single coordinate frame before being fed to the visual backbone network. For dual-arm tasks (i.e., PushChair and MoveBucket), we use the right-hand frame as the end-effector frame. For PickObject, which has a fixed base, the world frame is the same as the robot-base frame. Mean and standard deviation over 5 seeds are shown.
  • ...and 12 more figures