Table of Contents
Fetching ...

Synthetic vs. Real Training Data for Visual Navigation

Lauri Suomela, Sasanka Kuruppu Arachchige, German F. Torres, Harry Edelman, Joni-Kristian Kämäräinen

TL;DR

The proposed navigation policy architecture is a navigation policy architecture that bridges the sim-to-real appearance gap by leveraging pretrained visual representations and runs real-time on robot hardware and identifies on-policy learning as a key advantage of simulated training over training with real data.

Abstract

This paper investigates how the performance of visual navigation policies trained in simulation compares to policies trained with real-world data. Performance degradation of simulator-trained policies is often significant when they are evaluated in the real world. However, despite this well-known sim-to-real gap, we demonstrate that simulator-trained policies can match the performance of their real-world-trained counterparts. Central to our approach is a navigation policy architecture that bridges the sim-to-real appearance gap by leveraging pretrained visual representations and runs real-time on robot hardware. Evaluations on a wheeled mobile robot show that the proposed policy, when trained in simulation, outperforms its real-world-trained version by 31 and the prior state-of-the-art methods by 50 points in navigation success rate. Policy generalization is verified by deploying the same model onboard a drone. Our results highlight the importance of diverse image encoder pretraining for sim-to-real generalization, and identify on-policy learning as a key advantage of simulated training over training with real data. Code, model checkpoints and multimedia materials are available at https://lasuomela.github.io/faint/

Synthetic vs. Real Training Data for Visual Navigation

TL;DR

The proposed navigation policy architecture is a navigation policy architecture that bridges the sim-to-real appearance gap by leveraging pretrained visual representations and runs real-time on robot hardware and identifies on-policy learning as a key advantage of simulated training over training with real data.

Abstract

This paper investigates how the performance of visual navigation policies trained in simulation compares to policies trained with real-world data. Performance degradation of simulator-trained policies is often significant when they are evaluated in the real world. However, despite this well-known sim-to-real gap, we demonstrate that simulator-trained policies can match the performance of their real-world-trained counterparts. Central to our approach is a navigation policy architecture that bridges the sim-to-real appearance gap by leveraging pretrained visual representations and runs real-time on robot hardware. Evaluations on a wheeled mobile robot show that the proposed policy, when trained in simulation, outperforms its real-world-trained version by 31 and the prior state-of-the-art methods by 50 points in navigation success rate. Policy generalization is verified by deploying the same model onboard a drone. Our results highlight the importance of diverse image encoder pretraining for sim-to-real generalization, and identify on-policy learning as a key advantage of simulated training over training with real data. Code, model checkpoints and multimedia materials are available at https://lasuomela.github.io/faint/

Paper Structure

This paper contains 14 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: We investigate how simulation-trained navigation policies compare to ones trained with real-world data when deployed on a real robot.
  • Figure 2: Model architecture. FAINT implements the goal-reaching policy $\mathbf{a}_t = \pi_{g} (\mathbf{O}_t, S_{t})$. Observation and subgoal images are encoded with a frozen PVR, and a binocular encoder refines the goal tokens by conditioning on the latest observation. A sequence encoder with a predictor head then produces the actions $\mathbf{a}_t$. Subgoals $S_t$ are obtained from a separate subgoal selection policy $\pi_{s}$.
  • Figure 3: Implicit correspondences of the six highest attention values in the binocular encoder's first cross-attention layer.
  • Figure 4: Training data collected from the simulator - oracle actions $\mathbf{a}_{gt}$ that control the agent, agent observation $O_{t}$, and subgoal image $S_t$.
  • Figure 5: Example segments from different types of test routes.
  • ...and 2 more figures