Table of Contents
Fetching ...

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi

TL;DR

The paper tackles the generalization and data-efficiency challenges of deep reinforcement learning for visual navigation by introducing a target-driven policy that conditions on the goal image. It employs a deep siamese actor-critic architecture trained with an A3C-like, asynchronous protocol and a rich AI2-THOR simulation environment to enable scalable, realistic training and cross-scene/target transfer. The approach achieves faster convergence than state-of-the-art DRL methods, generalizes to unseen targets and scenes, and transfers to real robots with minimal fine-tuning, demonstrating practical applicability. The AI2-THOR framework and end-to-end, map-free navigation without explicit feature matching or 3D reconstruction mark significant steps toward deployable, vision-based robotic navigation.

Abstract

Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment. The supplementary video can be accessed at the following link: https://youtu.be/SmBxMDiOrvs.

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

TL;DR

The paper tackles the generalization and data-efficiency challenges of deep reinforcement learning for visual navigation by introducing a target-driven policy that conditions on the goal image. It employs a deep siamese actor-critic architecture trained with an A3C-like, asynchronous protocol and a rich AI2-THOR simulation environment to enable scalable, realistic training and cross-scene/target transfer. The approach achieves faster convergence than state-of-the-art DRL methods, generalizes to unseen targets and scenes, and transfers to real robots with minimal fine-tuning, demonstrating practical applicability. The AI2-THOR framework and end-to-end, map-free navigation without explicit feature matching or 3D reconstruction mark significant steps toward deployable, vision-based robotic navigation.

Abstract

Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2-THOR framework, which provides an environment with high-quality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment. The supplementary video can be accessed at the following link: https://youtu.be/SmBxMDiOrvs.

Paper Structure

This paper contains 21 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: The goal of our deep reinforcement learning model is to navigate towards a visual target with a minimum number of steps. Our model takes the current observation and the image of the target as input and generates an action in the 3D environment as the output. Our model learns to navigate to different targets in a scene without re-training.
  • Figure 2: Screenshots of our framework and other simulated learning frameworks: ALE ALE, ViZDoom ViZDoom, UETorch UETorch, Project Malmo malmo, SceneNet SceneNet, TORCS TORCS, SYNTHIA synthia, Virtual KITTI VirtualKITTI16.
  • Figure 3: Our framework provides a rich interaction platform for AI agents. It enables physical interactions, such as pushing or moving objects (the first row), as well as object interactions, such as changing the state of objects (the second row).
  • Figure 4: Network architecture of our deep siamese actor-critic model. The numbers in parentheses show the output dimensions. Layer parameters in the green squares are shared. The ResNet-50 layers (yellow) are pre-trained on ImageNet and fixed during training.
  • Figure 5: Data efficiency of training. Our model learns better navigation policies compared to the state-of-the-art A3C methods mnih2016asynchronous after 100M training frames.
  • ...and 4 more figures