Table of Contents
Fetching ...

Coarse-to-Fine 3D Keyframe Transporter

Xupeng Zhu, David Klee, Dian Wang, Boce Hu, Haojie Huang, Arsh Tangri, Robin Walters, Robert Platt

TL;DR

The paper introduces a Coarse-to-Fine 3D Keyframe Transporter that exploits bi-equivariant symmetry in Keyframe Imitation Learning to efficiently learn SE(3) actions for manipulation. By replacing 2D cross-correlation with a 3D cross-correlation framework and implementing a SE(3) coarse-to-fine action evaluator, it achieves strong sample efficiency and broad task coverage, including push, turn, and tool use. Key contributions include bi-equivariant policy formulation, 3D cross-correlation-based action inference, in-hand segmentation, and a multi-level C2F scheme that significantly reduces computation. Empirical results on RLBench and real-world tasks demonstrate substantial performance gains with limited demonstrations, highlighting the method’s practical impact for data-efficient robotic manipulation.

Abstract

Recent advances in Keyframe Imitation Learning (IL) have enabled learning-based agents to solve a diverse range of manipulation tasks. However, most approaches ignore the rich symmetries in the problem setting and, as a consequence, are sample-inefficient. This work identifies and utilizes the bi-equivariant symmetry within Keyframe IL to design a policy that generalizes to transformations of both the workspace and the objects grasped by the gripper. We make two main contributions: First, we analyze the bi-equivariance properties of the keyframe action scheme and propose a Keyframe Transporter derived from the Transporter Networks, which evaluates actions using cross-correlation between the features of the grasped object and the features of the scene. Second, we propose a computationally efficient coarse-to-fine SE(3) action evaluation scheme for reasoning the intertwined translation and rotation action. The resulting method outperforms strong Keyframe IL baselines by an average of >10% on a wide range of simulation tasks, and by an average of 55% in 4 physical experiments.

Coarse-to-Fine 3D Keyframe Transporter

TL;DR

The paper introduces a Coarse-to-Fine 3D Keyframe Transporter that exploits bi-equivariant symmetry in Keyframe Imitation Learning to efficiently learn SE(3) actions for manipulation. By replacing 2D cross-correlation with a 3D cross-correlation framework and implementing a SE(3) coarse-to-fine action evaluator, it achieves strong sample efficiency and broad task coverage, including push, turn, and tool use. Key contributions include bi-equivariant policy formulation, 3D cross-correlation-based action inference, in-hand segmentation, and a multi-level C2F scheme that significantly reduces computation. Empirical results on RLBench and real-world tasks demonstrate substantial performance gains with limited demonstrations, highlighting the method’s practical impact for data-efficient robotic manipulation.

Abstract

Recent advances in Keyframe Imitation Learning (IL) have enabled learning-based agents to solve a diverse range of manipulation tasks. However, most approaches ignore the rich symmetries in the problem setting and, as a consequence, are sample-inefficient. This work identifies and utilizes the bi-equivariant symmetry within Keyframe IL to design a policy that generalizes to transformations of both the workspace and the objects grasped by the gripper. We make two main contributions: First, we analyze the bi-equivariance properties of the keyframe action scheme and propose a Keyframe Transporter derived from the Transporter Networks, which evaluates actions using cross-correlation between the features of the grasped object and the features of the scene. Second, we propose a computationally efficient coarse-to-fine SE(3) action evaluation scheme for reasoning the intertwined translation and rotation action. The resulting method outperforms strong Keyframe IL baselines by an average of >10% on a wide range of simulation tasks, and by an average of 55% in 4 physical experiments.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The place module of Transporter Networks zeng2021transporter, along with follow-up works Huang-RSS-22ryu2023equivarianthuang2024fourierryu2023equivariant achieves bi-equivariance in place policy (e.g., picking an "L" shape and placing it in an "L"-shaped receptacle) by performing cross-correlation between the scene features $f_s$ and the in-hand features $f_{ih}$. In the computed action value map $Q_\text{place}$, the height and width represent the X and Y translations of the gripper, while the channels correspond to different gripper rotations. Therefore $Q_\text{place}$ densely evaluates each trans-rotational action.
  • Figure 2: Bi-equivariance in keyframe policies. Second column: given a scene, the policy $\pi$ prescribes an optimal action $a$. First column: if the scene is rotated by $g_1$, the optimal action should also be rotated: $g_1a$. Third column: if the in-hand object is rotated by $g_2$, the optimal action should pre-rotate to compensate: $ag_2^{-1}$.
  • Figure 3: Coarse-to-Fine $3$D Keyframe Transporter inferences in two steps. Left: in step 1, the in-hand features $s_{ih}$ are obtained by cropping and transforming the scene features $s$ into the gripper frame. Then the $\mathop{\mathrm{key}}\nolimits$ and $\mathop{\mathrm{query}}\nolimits$ U-net networks map observations $s$ and $s_{ih}$ into pyramids of latent features $f_s^l$ and $f_{ih}^l$ respectively. Middle: in step 2, the action values $Q_\text{T}^l: \hat{G}_l \rightarrow \mathrm{R}$ are computed through a coarse-to-fine cross-correlation between the latent scene features $f_s^l$ and in-hand features $f_{ih}$. At the coarse level, the evaluated actions cover a wide translational-rotational range in a coarse grid. In the end, the fine level narrows the trans-roto range but provides fine resolution for precise action evaluation. Lastly, gripper open-close and planner collision actions are evaluated by MLP with the features from the $key$ U-net.
  • Figure 4: Visualization of learned in-hand segmentation.
  • Figure 5: (a) shows 4 out of 18 RLBench tasks james2020rlbench. (b) when classifying 18 tasks by the equivariance, ours has advantages on bi-equivariant and mixed equivariance tasks but underperforms RVT on equivariant tasks. (c) "Bi-equ.": the top 5 tasks. "Mix Bi-equ./Equ.": the middle 9 tasks. "Equ.": the button 3 tasks.
  • ...and 1 more figures