Table of Contents
Fetching ...

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

Jiaming Zhou, Teli Ma, Kun-Yu Lin, Zifan Wang, Ronghe Qiu, Junwei Liang

TL;DR

The paper tackles the challenge of transferring visual representations learned from human data to robotic manipulation due to a human-robot domain discrepancy. It introduces HR-Align, a parameter-efficient adaptation that uses paired human-robot demonstrations and a contrastive alignment loss to align semantic representations, with lightweight adapters inserted into frozen pre-trained backbones. The method shows consistent improvements across 20 simulated tasks and 5 real-world tasks, including both single-task and language-conditioned multi-task settings, outperforming unadapted pre-trained models by notable margins. The work demonstrates that explicit semantic alignment via paired data can preserve the versatility of pre-trained models while enabling effective robot-domain adaptation, offering a scalable approach to cross-domain visual pre-training for robotics.

Abstract

Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks. To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner. Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements. These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models. Compared to existing pre-trained models, our adaptation method improves the average success rate by over 7% across multiple tasks on both simulated benchmarks and real-world evaluations.

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

TL;DR

The paper tackles the challenge of transferring visual representations learned from human data to robotic manipulation due to a human-robot domain discrepancy. It introduces HR-Align, a parameter-efficient adaptation that uses paired human-robot demonstrations and a contrastive alignment loss to align semantic representations, with lightweight adapters inserted into frozen pre-trained backbones. The method shows consistent improvements across 20 simulated tasks and 5 real-world tasks, including both single-task and language-conditioned multi-task settings, outperforming unadapted pre-trained models by notable margins. The work demonstrates that explicit semantic alignment via paired data can preserve the versatility of pre-trained models while enabling effective robot-domain adaptation, offering a scalable approach to cross-domain visual pre-training for robotics.

Abstract

Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks. To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner. Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements. These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models. Compared to existing pre-trained models, our adaptation method improves the average success rate by over 7% across multiple tasks on both simulated benchmarks and real-world evaluations.
Paper Structure (16 sections, 5 equations, 5 figures, 7 tables)

This paper contains 16 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An overview of the proposed Human-Robot Semantic Alignment method. Given paired human-robot videos, the pre-trained models are efficiently adapted on the robot data to learn semantics aligned with those in human data.
  • Figure 2: Left: our real-world experimental workspace. Top right: illustrations of five real-world tasks. Bottom right: experimental results on five tasks, where the pre-trained D4R and R3M, and our adapted D4R-Align, R3M-Align models are evaluated.
  • Figure 3: The t-SNE visualizations of the RLBench's feature distributions of R3M and R3M-Align models. Each color denotes a task, and the points denote different samples.
  • Figure S1: Examples of the 18 RLBench tasks (front view) with corresponding human instructions (sourced from ma2024contrastive).
  • Figure S2: Examples of the five real-world tasks are shown, with each row presenting an instance of the corresponding task. For each demonstration, we provide visual observations at six different timestamps.