Table of Contents
Fetching ...

Bird's Eye View Based Pretrained World model for Visual Navigation

Kiran Lekkala, Chen Liu, Laurent Itti

TL;DR

A novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world is presented.

Abstract

Sim2Real transfer has gained popularity because it helps transfer from inexpensive simulators to real world. This paper presents a novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world. To facilitate transfer, we use an intermediary representation that is based on \textit{Bird's Eye View (BEV)} images. Thus, our robot learns to navigate in a simulator by first learning to translate from complex \textit{First-Person View (FPV)} based RGB images to BEV representations, then learning to navigate using those representations. Later, when tested in the real world, the robot uses the perception model that translates FPV-based RGB images to embeddings that were learned by the FPV to BEV translator and that can be used by the downstream policy. The incorporation of state-checking modules using \textit{Anchor images} and Mixture Density LSTM not only interpolates uncertain and missing observations but also enhances the robustness of the model in the real-world. We trained the model using data from a Differential drive robot in the CARLA simulator. Our methodology's effectiveness is shown through the deployment of trained models onto a real-world Differential drive robot. Lastly we release a comprehensive codebase, dataset and models for training and deployment (\url{https://sites.google.com/view/value-explicit-pretraining}).

Bird's Eye View Based Pretrained World model for Visual Navigation

TL;DR

A novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world is presented.

Abstract

Sim2Real transfer has gained popularity because it helps transfer from inexpensive simulators to real world. This paper presents a novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world. To facilitate transfer, we use an intermediary representation that is based on \textit{Bird's Eye View (BEV)} images. Thus, our robot learns to navigate in a simulator by first learning to translate from complex \textit{First-Person View (FPV)} based RGB images to BEV representations, then learning to navigate using those representations. Later, when tested in the real world, the robot uses the perception model that translates FPV-based RGB images to embeddings that were learned by the FPV to BEV translator and that can be used by the downstream policy. The incorporation of state-checking modules using \textit{Anchor images} and Mixture Density LSTM not only interpolates uncertain and missing observations but also enhances the robustness of the model in the real-world. We trained the model using data from a Differential drive robot in the CARLA simulator. Our methodology's effectiveness is shown through the deployment of trained models onto a real-world Differential drive robot. Lastly we release a comprehensive codebase, dataset and models for training and deployment (\url{https://sites.google.com/view/value-explicit-pretraining}).
Paper Structure (13 sections, 8 equations, 8 figures)

This paper contains 13 sections, 8 equations, 8 figures.

Figures (8)

  • Figure 1: Overview of our system We first train the visual navigation system on a large-scale dataset collected in the simulator and deploy the frozen model in an unseen real-world environment.
  • Figure 2: Training pipeline for the perception model. (a) During the training phase, the ResNet model is trained using a set of temporal sequences, consisting of pairs of input (FPV images, displacement and orientation to goal) and output (BEV images) from the simulator. Our contrastive loss embeds positive closer to anchor and negative farther away. (b) In the bottom, we pictorially show the input and the output that is used to train the memory module.
  • Figure 3: Working of the System. RGB observation $o_t$ at time step $t$ is passed to the perception model (blue) that compresses it into an embedding $z_t$. The memory model takes the current latent representation $z_t$ and uses the historical context to refine the state into $\hat{z}_t$. These embeddings could either be used to train a control policy (orange) or to reconstruct the Bird's Eye View (BEV) for planning (grey). Both utilities result in an action command $a_t$.
  • Figure 4: Robustness enhancement using Memory module.TSC (red) only takes input from the representation $z_t$ when it comes with a high confidence score. Otherwise, it takes the previous prediction by the LSTM $\hat{z}_{t-1}$ as interpolation. ASC (green) improves the representation of the incoming observation by making it in-domain. The crosses above correspond to rejecting the precepts and using the model's state prediction as the current state.
  • Figure 5: Out-of-domain and real-world evaluation We constructed two 6-class validation datasets: one from the simulator (first row in the table) and another from street-view data (second row). Each class corresponds to the BEV images shown above. We specify accuracies for each class. Along with that, we also specify the success rate (SR) of the agent, when the encoder is deployed for real-world visual navigation. Our method outperformed the ResNet classifier (baseline) on both the unseen simulation dataset, the real-world validation dataset and real-world navigation as shown above.
  • ...and 3 more figures