Table of Contents
Fetching ...

Discovering Robotic Interaction Modes with Discrete Representation Learning

Liquan Wang, Ankit Goyal, Haoping Xu, Animesh Garg

TL;DR

This paper presents ActAIM2, which learns a discrete representation of robot manipulation interaction modes in a purely unsupervised fashion, without the use of expert labels or simulator-based privileged information, using novel data collection methods involving simulator rollouts.

Abstract

Human actions manipulating articulated objects, such as opening and closing a drawer, can be categorized into multiple modalities we define as interaction modes. Traditional robot learning approaches lack discrete representations of these modes, which are crucial for empirical sampling and grounding. In this paper, we present ActAIM2, which learns a discrete representation of robot manipulation interaction modes in a purely unsupervised fashion, without the use of expert labels or simulator-based privileged information. Utilizing novel data collection methods involving simulator rollouts, ActAIM2 consists of an interaction mode selector and a low-level action predictor. The selector generates discrete representations of potential interaction modes with self-supervision, while the predictor outputs corresponding action trajectories. Our method is validated through its success rate in manipulating articulated objects and its robustness in sampling meaningful actions from the discrete representation. Extensive experiments demonstrate ActAIM2's effectiveness in enhancing manipulability and generalizability over baselines and ablation studies. For videos and additional results, see our website: https://actaim2.github.io/.

Discovering Robotic Interaction Modes with Discrete Representation Learning

TL;DR

This paper presents ActAIM2, which learns a discrete representation of robot manipulation interaction modes in a purely unsupervised fashion, without the use of expert labels or simulator-based privileged information, using novel data collection methods involving simulator rollouts.

Abstract

Human actions manipulating articulated objects, such as opening and closing a drawer, can be categorized into multiple modalities we define as interaction modes. Traditional robot learning approaches lack discrete representations of these modes, which are crucial for empirical sampling and grounding. In this paper, we present ActAIM2, which learns a discrete representation of robot manipulation interaction modes in a purely unsupervised fashion, without the use of expert labels or simulator-based privileged information. Utilizing novel data collection methods involving simulator rollouts, ActAIM2 consists of an interaction mode selector and a low-level action predictor. The selector generates discrete representations of potential interaction modes with self-supervision, while the predictor outputs corresponding action trajectories. Our method is validated through its success rate in manipulating articulated objects and its robustness in sampling meaningful actions from the discrete representation. Extensive experiments demonstrate ActAIM2's effectiveness in enhancing manipulability and generalizability over baselines and ablation studies. For videos and additional results, see our website: https://actaim2.github.io/.

Paper Structure

This paper contains 64 sections, 15 equations, 19 figures, 4 tables, 1 algorithm.

Figures (19)

  • Figure 1: ActAIM2 identifies meaningful interaction modes such as open and close drawers from RGB-D images of articulated objects and robots. It represents these modes as discrete clusters of embeddings and trains a policy to generate control actions for each cluster-based interaction.
  • Figure 2: (a) GMM Model Selector The mode selector, a generative model, processes the differences between the initial and final image visual embeddings as generated data, using the initial image embeddings as the conditional variable. (b) Behavior Cloning Action Predictor Interaction mode $\epsilon$ is sampled from latent space embedding from model selector. 5 Multiview RGBD observations from circled cameras are back-projected and fused into a color point cloud to render novel views. Rendered image tokens and interaction mode token are contacted and fed through a multiview transformer to predict action $a =(\mathbf{p}, \mathbf{R}, \mathbf{q})$.
  • Figure 3: Given different task embedding, we see how action predictor produces actions representing distinct interaction modes. Here, we visualize the camera view and the prediction heatmap from the top for object instances. The first row shows heatmaps for pushing and pulling the handle, while the second row shows heatmaps for closing the left or right door. More qualitative results please see the appendix \ref{['sec:appd_results']}
  • Figure 4: The figure illustrates the drawer manipulation task conducted by the Kinova Gen2 robot arm. The task involves interacting with a three-drawer shelf, starting from an initial half-open state (center). The robot executes two modes: opening (left) and closing (right) the drawers, with arrows showing the gripper's movement direction during each interaction.
  • Figure 5: (a) Real World Setup: In the image, we demonstrate the real-world experiment setup using a single RGB-D camera (Azure Kinect) to capture visual information and a Kinova robot equipped with a parallel jaw for interacting with an articulated shelf object. (b) Shelf Object Point Cloud Illustration: We present the point cloud of the shelf object with three movable drawers, extracted from the RGB-D camera.
  • ...and 14 more figures