Table of Contents
Fetching ...

Visual Hindsight Self-Imitation Learning for Interactive Navigation

Kibeom Kim, Kisung Shin, Min Whoo Lee, Moonhoen Lee, Minsu Lee, Byoung-Tak Zhang

TL;DR

This work tackles sample-inefficient, instruction-based interactive visual navigation under sparse rewards by introducing Visual Hindsight Self-Imitation Learning (VHS). VHS combines hindsight goal relabeling with self-imitation and a novel Prototypical Goal (PG) embedding to enable vision-based relabeling in partially observable environments, all trained via an A3C backbone with goal-aware SupCon learning. The approach yields state-of-the-art results on three tasks of escalating difficulty, demonstrates strong sample efficiency, and provides extensive ablations and visualizations to justify the PG embedding and VHS mechanisms. The work advances practical embodied AI by reducing reliance on dense rewards or expert demonstrations and highlights directions for continuous-action settings and broader goal representations.

Abstract

Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because the complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS) for enhancing sample efficiency through hindsight goal re-labeling and self-imitation. We also introduce a prototypical goal embedding method derived from experienced goal observations, that is particularly effective in vision-based and partially observable environments. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance and sample efficiency.

Visual Hindsight Self-Imitation Learning for Interactive Navigation

TL;DR

This work tackles sample-inefficient, instruction-based interactive visual navigation under sparse rewards by introducing Visual Hindsight Self-Imitation Learning (VHS). VHS combines hindsight goal relabeling with self-imitation and a novel Prototypical Goal (PG) embedding to enable vision-based relabeling in partially observable environments, all trained via an A3C backbone with goal-aware SupCon learning. The approach yields state-of-the-art results on three tasks of escalating difficulty, demonstrates strong sample efficiency, and provides extensive ablations and visualizations to justify the PG embedding and VHS mechanisms. The work advances practical embodied AI by reducing reliance on dense rewards or expert demonstrations and highlights directions for continuous-action settings and broader goal representations.

Abstract

Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because the complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS) for enhancing sample efficiency through hindsight goal re-labeling and self-imitation. We also introduce a prototypical goal embedding method derived from experienced goal observations, that is particularly effective in vision-based and partially observable environments. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance and sample efficiency.
Paper Structure (34 sections, 5 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 5 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of the learning process for proposed method. This approach employs prototypical goal embeddings to pursue goals, diverging from traditional word-embedding instructions. It introduces a strategy for re-labeling goals in failed episodes. The method facilitates learning through self-imitation and benefits from sparse reward settings in environments with visual inputs.
  • Figure 2: The overall architecture. Prototypical Goal (PG) embedding samples goal observations collected from goal storage, extracts features using the feature extractor from agent, and computes prototypical features. If the episode ends successfully, the goal observation is collected by pairing the end-of-episode observation with the desired goal. For each episode that fails, the re-labeling process is run to replace the goal with the last observation of the episode, and then perform the Visual Hindsight Self-Imitation Learning.
  • Figure 3: Learning curves for three visual navigation tasks. In all tasks, our method shows rapid improvement and saturation in performance, demonstrating high sample efficiency. Especially in Task3, it shows a significant gap with baselines in task performance. The x-axis is number of updates and the y-axis is success rate. Each curve is produced from 7 trials and indicate bounds as mean $\pm$ standard deviation.
  • Figure 4: Learning curves of two ablation studies and an analysis of proportion of reward types. Experiments are performed on Task2 (Interactive Object Navigation), x-axis is number of updates, and y-axis is success rate in (a), (b), and reward proportion in (c).
  • Figure 5: Visualization of prototypical goals and embeddings of goal observations. The center of each region shows the prototypical goal embedding of the corresponding object, while the neighbouring images visualize data from goal storage with close feature distances and point outward to the source images.
  • ...and 4 more figures