Table of Contents
Fetching ...

Natural Language Can Help Bridge the Sim2Real Gap

Albert Yu, Adeline Foote, Raymond Mooney, Roberto Martín-Martín

TL;DR

The paper addresses the challenge of learning image-conditioned robotic policies under limited real-world data by bridging the sim2real gap with language-guided cross-domain representations. It introduces Lang4Sim2Real, which pretrains an image encoder to align sim and real observations via language annotations, using two pretraining variants: Language-Regression and Language-Distance Learning. A frozen backbone with FiLM adapters enables multitask, multidomain imitation learning on both abundant sim data and scarce real demonstrations, achieving substantial performance gains over prior sim2real methods and vision-language baselines. The approach is demonstrated across three task suites, including long-horizon and deformable-object tasks, highlighting improved sample efficiency and transfer robustness with practical implications for robotics.

Abstract

The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%. See additional videos and materials at https://robin-lab.cs.utexas.edu/lang4sim2real/.

Natural Language Can Help Bridge the Sim2Real Gap

TL;DR

The paper addresses the challenge of learning image-conditioned robotic policies under limited real-world data by bridging the sim2real gap with language-guided cross-domain representations. It introduces Lang4Sim2Real, which pretrains an image encoder to align sim and real observations via language annotations, using two pretraining variants: Language-Regression and Language-Distance Learning. A frozen backbone with FiLM adapters enables multitask, multidomain imitation learning on both abundant sim data and scarce real demonstrations, achieving substantial performance gains over prior sim2real methods and vision-language baselines. The approach is demonstrated across three task suites, including long-horizon and deformable-object tasks, highlighting improved sample efficiency and transfer robustness with practical implications for robotics.

Abstract

The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a considerable amount of visual data. However, when learning in the real world, data is expensive. Sim2Real is a promising paradigm for overcoming data scarcity in the real-world target domain by using a simulator to collect large amounts of cheap data closely related to the target task. However, it is difficult to transfer an image-conditioned policy from sim to real when the domains are very visually dissimilar. To bridge the sim2real visual gap, we propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Our key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. We demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. We can then use this image encoder as the backbone of an IL policy trained simultaneously on a large amount of simulated and a handful of real demonstrations. Our approach outperforms widely used prior sim2real methods and strong vision-language pretraining baselines like CLIP and R3M by 25 to 40%. See additional videos and materials at https://robin-lab.cs.utexas.edu/lang4sim2real/.
Paper Structure (48 sections, 3 equations, 6 figures, 10 tables, 4 algorithms)

This paper contains 48 sections, 3 equations, 6 figures, 10 tables, 4 algorithms.

Figures (6)

  • Figure 1: Bridging the sim2real gap with language. Robot images from simulation and the real world with similar language descriptions (green & purple borders) are mapped to similar features in language embedding space, while sim and real images with different language descriptions (teal & red) are mapped to faraway locations. We propose using language embedding similarities to re-shape the image embeddings (center) to create a domain-invariant image space. A policy is learned conditioned on these image embeddings from both sim and real images (right).
  • Figure 2: Method. (i) Top: During Image-Language Pretraining, we train the image encoder $f_{cnn}$ using the language embeddings associated with descriptions of both sim and real image observations. $f^{d}_{img}$ and $f^{d}_{lang}$ refer to the output features of the CNN and the LLM, respectively, in domain $d$. With regression-based loss (A) the image embeddings are pushed to predict the corresponding language embeddings whereas with distance based loss (B) the pair of image embeddings is pushed together/apart based on the similarity of the language embeddings. (ii) Bottom: During Multitask, Multidomain BC, we freeze our pretrained $f_{cnn}$, add adapter modules and a policy head and allow the last layer of the CNN to finetune, then train the resulting multitask language-conditioned policy on $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{s}}}\limits \cup \mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{target}}}\limits$.
  • Figure 3: The columns depict the three task suites while each row represents an image domain. Rows from Top to Bottom: Simulation $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{s}}}\limits$, sim2sim$\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{target}}}\limits$, sim2real$\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{target}}}\limits$.Columns from Left to Right: Stack Object, Multi-step Pick and Place, and Wrap Wire tasks. While similar enough to transfer prior knowledge between them, our $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{s}}}\limits$ and $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}}}\limits$ task versions have a considerable gap (Sec. \ref{['sec:exp:sim2sim-sim2real-diff']}) that we are able to bridge using language as regularization for the image representations.
  • Figure 4: These plots show the action distribution of demonstrations across both sim and real, broken down by each component of the action: $xy$-action magnitude, $z$-axis actions, and gripper actions. The first row shows simulation (green) and real world (blue) action distributions for images described by similar language. The second row shows the same distribution of simulation actions (green) as in the first row, but compared with real-world action distributions from images labeled with very different language from the sim actions (blue). Notably, the action distributions are generally similar for images with similar language (first row), and different for images with different language (second row). This suggests that pretraining our CNN on language embedding prediction benefits downstream policy learning because it allows the domain-invariant learned representations to tap into similar action distributions for completing a task.
  • Figure 5: This table builds on Figure \ref{['fig:tasks']} and depicts the 3 datasets for each task with filmstrips. The rows show the three task suites while each column represents one of the three datasets we use during pretraining or policy learning. Our main results in Tables \ref{['tab:real']} and \ref{['tab:sim']} use $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{s}}}\limits \cup \mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{target}}}\limits$ for pretraining and policy learning, whereas our results in Table \ref{['tab:real-w-prior']} use $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{s}}}\limits \cup \mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{prior}}}\limits$ for pretraining and $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{s}}}\limits \cup \mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{target}}}\limits$ for policy learning. This table shows the visual differences between sim and real, as well as the task in $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{prior}}}\limits$ versus $\mathop{\mathrm{\mathop{\mathrm{\mathcal{D}}}\limits^{t}_{target}}}\limits$.
  • ...and 1 more figures