Table of Contents
Fetching ...

GOMP: Grasped Object Manifold Projection for Multimodal Imitation Learning of Manipulation

William van den Bogert, Gregory Linkowski, Nima Fazeli

TL;DR

<3-5 sentence high-level summary> This paper tackles the problem of compounding errors in imitation learning for high-precision manipulation by introducing Grasped Object Manifold Projection (GOMP), which constrains a non-rigidly grasped object to a learned low-dimensional task manifold derived from expert demonstrations. GOMP couples diffusion-based IL with an interactive 7-arm bandit to select the optimal projection dimensionality onto the task manifold, thereby reducing error accumulation and improving robustness across four precise assembly tasks using tactile feedback. The approach relies on PCA-based task-space derivation via PGA, careful observation encoding from tactile and proprioceptive signals, and a strong demonstration-processing pipeline; results show consistent improvements over vanilla diffusion-based IL in nut threading, peg insertion, USB insertion, and battery cover placement. The method is modality-agnostic and aims to enable fixtureless, high-precision robotic assembly in practical settings by leveraging geometry-driven constraints on grasped objects.

Abstract

Imitation Learning (IL) holds great potential for learning repetitive manipulation tasks, such as those in industrial assembly. However, its effectiveness is often limited by insufficient trajectory precision due to compounding errors. In this paper, we introduce Grasped Object Manifold Projection (GOMP), an interactive method that mitigates these errors by constraining a non-rigidly grasped object to a lower-dimensional manifold. GOMP assumes a precise task in which a manipulator holds an object that may shift within the grasp in an observable manner and must be mated with a grounded part. Crucially, all GOMP enhancements are learned from the same expert dataset used to train the base IL policy, and are adjusted with an n-arm bandit-based interactive component. We propose a theoretical basis for GOMP's improvement upon the well-known compounding error bound in IL literature. We demonstrate the framework on four precise assembly tasks using tactile feedback, and note that the approach remains modality-agnostic. Data and videos are available at williamvdb.github.io/GOMPsite.

GOMP: Grasped Object Manifold Projection for Multimodal Imitation Learning of Manipulation

TL;DR

<3-5 sentence high-level summary> This paper tackles the problem of compounding errors in imitation learning for high-precision manipulation by introducing Grasped Object Manifold Projection (GOMP), which constrains a non-rigidly grasped object to a learned low-dimensional task manifold derived from expert demonstrations. GOMP couples diffusion-based IL with an interactive 7-arm bandit to select the optimal projection dimensionality onto the task manifold, thereby reducing error accumulation and improving robustness across four precise assembly tasks using tactile feedback. The approach relies on PCA-based task-space derivation via PGA, careful observation encoding from tactile and proprioceptive signals, and a strong demonstration-processing pipeline; results show consistent improvements over vanilla diffusion-based IL in nut threading, peg insertion, USB insertion, and battery cover placement. The method is modality-agnostic and aims to enable fixtureless, high-precision robotic assembly in practical settings by leveraging geometry-driven constraints on grasped objects.

Abstract

Imitation Learning (IL) holds great potential for learning repetitive manipulation tasks, such as those in industrial assembly. However, its effectiveness is often limited by insufficient trajectory precision due to compounding errors. In this paper, we introduce Grasped Object Manifold Projection (GOMP), an interactive method that mitigates these errors by constraining a non-rigidly grasped object to a lower-dimensional manifold. GOMP assumes a precise task in which a manipulator holds an object that may shift within the grasp in an observable manner and must be mated with a grounded part. Crucially, all GOMP enhancements are learned from the same expert dataset used to train the base IL policy, and are adjusted with an n-arm bandit-based interactive component. We propose a theoretical basis for GOMP's improvement upon the well-known compounding error bound in IL literature. We demonstrate the framework on four precise assembly tasks using tactile feedback, and note that the approach remains modality-agnostic. Data and videos are available at williamvdb.github.io/GOMPsite.

Paper Structure

This paper contains 27 sections, 17 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An overview of Grasped Object Manifold Projection as implemented in this paper. Vision-based tactile sensors (GelSlim 4.0) provide a field of shear displacements and raw RGB images. Shear-fields and proprioception serve as modalities for Diffusion Policy (DP), while RGB images are used for in-hand object pose estimation ($\mathbf{IHP}$). DP trajectories and object poses are used to project robot-driven grasped object behavior to the task space $\mathcal{T}$, derived from principal geodesic analysis (PGA) of the expert dataset. A 7-armed bandit adjusts this projection based on rollout rewards.
  • Figure 2: a) Projection loss priors (from Eq. \ref{['eq:proj_loss']}) derived from the dataset of each task tested in Section \ref{['section:results']}. b) Projection of the object and robot trajectories (as calculated in Section \ref{['rollout']}) in the expert dataset of nut threading along the manifold determined by Eq. \ref{['eq:proj_param']}, in this case $i^*_k=2$. c) Diffused vs. projected actions visualized for peg insertion and USB insertion alongside the current object pose observation. Here, the action is visualized as the last point in the action trajectory.
  • Figure 3: The tactile and ground-truth $\mathrm{SE}(2)$ object pose data for in-hand pose estimation is collected while the object is grasped between the tactile sensors, and a human manually moves the object in the grasp. The grasp occasionally opens during this collection. Ground-truth object pose data comes from AprilTag registration using a RealSense D435 camera.
  • Figure 4: View of manual demonstration as described in Section \ref{['subsection:demo_collection3']}, and the pipeline of the tactile and proprioceptive data toward the generation of the combined policy $\hat{\pi}_{\mathbf{s}_o}$.
  • Figure 5: Nut threading results. Top: Handoff, initialization, policy, and success snapshots. Left: GOMP vs DP performance results as demonstrations are added to training. Right: Change of highest values in $Q(i)$ from Section \ref{['interactive']} over the 60-trial horizon, averaged over the 4 runs of GOMP. Filled area surrounding curves represents $\sigma$ (the initial projection loss in Eq. \ref{['eq:proj_loss']} sometimes yielded $i_k^*=1$ but $i_k^*=2$ was the eventual result in all 4 runs of USB insertion).
  • ...and 3 more figures