Table of Contents
Fetching ...

Raising Body Ownership in End-to-End Visuomotor Policy Learning via Robot-Centric Pooling

Zheyu Zhuang, Ville Kyrki, Danica Kragic

TL;DR

The results demonstrate that RcP significantly enhances the policies’ robustness against various unseen distractors, including self-distractors, positioned at different locations, and enables the learnt policy to be far more resilient to aggressive pixel shifts compared to the baselines.

Abstract

We present Robot-centric Pooling (RcP), a novel pooling method designed to enhance end-to-end visuomotor policies by enabling differentiation between the robots and similar entities or their surroundings. Given an image-proprioception pair, RcP guides the aggregation of image features by highlighting image regions correlating with the robot's proprioceptive states, thereby extracting robot-centric image representations for policy learning. Leveraging contrastive learning techniques, RcP integrates seamlessly with existing visuomotor policy learning frameworks and is trained jointly with the policy using the same dataset, requiring no extra data collection involving self-distractors. We evaluate the proposed method with reaching tasks in both simulated and real-world settings. The results demonstrate that RcP significantly enhances the policies' robustness against various unseen distractors, including self-distractors, positioned at different locations. Additionally, the inherent robot-centric characteristic of RcP enables the learnt policy to be far more resilient to aggressive pixel shifts compared to the baselines.

Raising Body Ownership in End-to-End Visuomotor Policy Learning via Robot-Centric Pooling

TL;DR

The results demonstrate that RcP significantly enhances the policies’ robustness against various unseen distractors, including self-distractors, positioned at different locations, and enables the learnt policy to be far more resilient to aggressive pixel shifts compared to the baselines.

Abstract

We present Robot-centric Pooling (RcP), a novel pooling method designed to enhance end-to-end visuomotor policies by enabling differentiation between the robots and similar entities or their surroundings. Given an image-proprioception pair, RcP guides the aggregation of image features by highlighting image regions correlating with the robot's proprioceptive states, thereby extracting robot-centric image representations for policy learning. Leveraging contrastive learning techniques, RcP integrates seamlessly with existing visuomotor policy learning frameworks and is trained jointly with the policy using the same dataset, requiring no extra data collection involving self-distractors. We evaluate the proposed method with reaching tasks in both simulated and real-world settings. The results demonstrate that RcP significantly enhances the policies' robustness against various unseen distractors, including self-distractors, positioned at different locations. Additionally, the inherent robot-centric characteristic of RcP enables the learnt policy to be far more resilient to aggressive pixel shifts compared to the baselines.

Paper Structure

This paper contains 14 sections, 12 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Body ownership via Robot-centric Pooling. RcP enables a conventional policy regression baseline to foster self-recognition and the ability to distinguish self from others. (a): A testing sample including a self-distractor (right). (b): Image-proprioception alignment scores for self-state, $\boldsymbol{p^+}$. (c:) Image saliency map srinivas2019fullgrad with $\boldsymbol{p^+}$ based on regressed policy (warmer colours indicate higher relevance). (d): IPA scores for the distractor's state $\boldsymbol{p^-}$. (e): Saliency map with $\boldsymbol{p^-}$. (f): Saliency map from the Spatial-Softmax finn2016_spatial_encoder_vs baseline.
  • Figure 2: System Overview and the Robot-centric Pooling Module.(a): Robot-centric Pooling extracts the most relevant feature corresponding to the identified self for the regression task. (b): RcP computes Image-Proprioception Alignment (IPA) scores from an image-proprioception pair $(\boldsymbol{x}, \boldsymbol{p})$ and aggregates image values accordingly to create a context vector for contrastive learning and image representation in the regression pipeline.
  • Figure 3: Illustration of the Contrastive Learning Framework.(a): Similar to the pipeline proposed in MOCO he2020_moco, the augmented images are separately encoded by the RcP's image encoder and its momentum averaging copy. (b): For each image, the manipulator region is cropped and pasted onto two random backgrounds at random spatial locations (green firm arrows). A self-distractor is cropped from a random image drawn from the training dataset and randomly pasted onto one of the augmented images (red dashed arrow).
  • Figure 4: Illustration of Simulated Experiments, Input Saliency Maps, and Real Experiments. (a): Three distractors: the self-distractor, a Franka Panda robot, and a static object (a pot plant), are positioned at four distinct locations: behind the robot towards the left, behind the robot towards the right, alongside the robot, and in front of the robot. During the experiments, both the self-distractor and the Franka Panda execute random actions. (b): We employ an image saliency visualisation tool, FullGrad srinivas2019fullgrad, to visualise the activated image regions for different policies. RcP: Robot-centric Pooling, SSM: Spatial-Softmax. (c): The real-world setup features a second-person camera view, with the robot dominant on the left side of the image. A movable divider can conceal and reveals the distractor robot against a less-structured background based on scenarios. The rolled-shift image is used for testing the networks' robustness against image shifts.
  • Figure 5: (a): The Translation Error after Self-Distractor's Presence (50 Target Poses). The translation error is measured at the robot's tool-central point upon convergence, comparing trajectories aimed at the same target with and without a self-distractor. (b): Reaching success vs percentage of pixel shift. UR5 multi-instance reaching experiment (Left) and Franka single-instance reaching task (Right).