Visual Hindsight Self-Imitation Learning for Interactive Navigation
Kibeom Kim, Kisung Shin, Min Whoo Lee, Moonhoen Lee, Minsu Lee, Byoung-Tak Zhang
TL;DR
This work tackles sample-inefficient, instruction-based interactive visual navigation under sparse rewards by introducing Visual Hindsight Self-Imitation Learning (VHS). VHS combines hindsight goal relabeling with self-imitation and a novel Prototypical Goal (PG) embedding to enable vision-based relabeling in partially observable environments, all trained via an A3C backbone with goal-aware SupCon learning. The approach yields state-of-the-art results on three tasks of escalating difficulty, demonstrates strong sample efficiency, and provides extensive ablations and visualizations to justify the PG embedding and VHS mechanisms. The work advances practical embodied AI by reducing reliance on dense rewards or expert demonstrations and highlights directions for continuous-action settings and broader goal representations.
Abstract
Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because the complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS) for enhancing sample efficiency through hindsight goal re-labeling and self-imitation. We also introduce a prototypical goal embedding method derived from experienced goal observations, that is particularly effective in vision-based and partially observable environments. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance and sample efficiency.
