Table of Contents
Fetching ...

Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning

Khanh Nguyen, Hal Daumé

TL;DR

HANNA introduces a photo-realistic navigation environment where agents can request multimodal assistance from simulated human assistants (ANNA). The authors propose a memory-augmented, hierarchical policy with retrospective, curiosity-driven imitation learning to learn when and how to ask for help and how to interpret language-vision routes. Key contributions include the Hanna simulator, the I3L learning framework with reference and curiosity losses, and a hierarchical recurrent architecture that effectively leverages language and vision instructions. Empirical results show substantial improvements in task success, especially in unseen environments, validating the approach and highlighting the value of language-enabled guidance for robust navigation.

Abstract

Mobile agents that can leverage help from humans can potentially accomplish more complex tasks than they could entirely on their own. We develop "Help, Anna!" (HANNA), an interactive photo-realistic simulator in which an agent fulfills object-finding tasks by requesting and interpreting natural language-and-vision assistance. An agent solving tasks in a HANNA environment can leverage simulated human assistants, called ANNA (Automatic Natural Navigation Assistants), which, upon request, provide natural language and visual instructions to direct the agent towards the goals. To address the HANNA problem, we develop a memory-augmented neural agent that hierarchically models multiple levels of decision-making, and an imitation learning algorithm that teaches the agent to avoid repeating past mistakes while simultaneously predicting its own chances of making future progress. Empirically, our approach is able to ask for help more effectively than competitive baselines and, thus, attains higher task success rate on both previously seen and previously unseen environments. We publicly release code and data at https://github.com/khanhptnk/hanna . A video demo is available at https://youtu.be/18P94aaaLKg .

Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning

TL;DR

HANNA introduces a photo-realistic navigation environment where agents can request multimodal assistance from simulated human assistants (ANNA). The authors propose a memory-augmented, hierarchical policy with retrospective, curiosity-driven imitation learning to learn when and how to ask for help and how to interpret language-vision routes. Key contributions include the Hanna simulator, the I3L learning framework with reference and curiosity losses, and a hierarchical recurrent architecture that effectively leverages language and vision instructions. Empirical results show substantial improvements in task success, especially in unseen environments, validating the approach and highlighting the value of language-enabled guidance for robust navigation.

Abstract

Mobile agents that can leverage help from humans can potentially accomplish more complex tasks than they could entirely on their own. We develop "Help, Anna!" (HANNA), an interactive photo-realistic simulator in which an agent fulfills object-finding tasks by requesting and interpreting natural language-and-vision assistance. An agent solving tasks in a HANNA environment can leverage simulated human assistants, called ANNA (Automatic Natural Navigation Assistants), which, upon request, provide natural language and visual instructions to direct the agent towards the goals. To address the HANNA problem, we develop a memory-augmented neural agent that hierarchically models multiple levels of decision-making, and an imitation learning algorithm that teaches the agent to avoid repeating past mistakes while simultaneously predicting its own chances of making future progress. Empirically, our approach is able to ask for help more effectively than competitive baselines and, thus, attains higher task success rate on both previously seen and previously unseen environments. We publicly release code and data at https://github.com/khanhptnk/hanna . A video demo is available at https://youtu.be/18P94aaaLKg .

Paper Structure

This paper contains 39 sections, 4 theorems, 21 equations, 4 figures, 13 tables, 1 algorithm.

Key Result

Lemma 1

(proof in Appendix A) To guide the agent between any two locations using $O(\log N)$ instructions, we need to collect instructions for $\Theta(N \log N)$ location pairs.

Figures (4)

  • Figure 1: An example Hanna task. Initially, the agent stands in the bedroom at green,255;blue,0,draw,scale=0.2]A; and is requested by a human requester to "find a mug." The agent begins, but gets lost somewhere in the bathroom. It gets to the start location of route green,220;blue,0,scale=0.5](0,0) .. controls (0,0.5) and (0.5,0) .. (0.5,0.5); (green,255;blue,0,draw,scale=0.4]B;) to request help from Anna. Upon request, Anna assigns the agent a navigation subtask described by a natural language instruction that guides the agent to a target location, and an image of the view at that location. The agents follows the language instruction and arrives at green,255;blue,0,draw,scale=0.5]C;, where it observes a match between the target image and the current view, thus decides to depart route green,220;blue,0,scale=0.5](0,0) .. controls (0,0.5) and (0.5,0) .. (0.5,0.5);. After that, it resumes the main task of finding a mug. From this point, the agent gets lost one more time and has to query Anna for another subtask that helps it follow route green,220;blue,220,scale=0.5](0,0) .. controls (0,0.5) and (0.5,0) .. (0.5,0.5); and enter the kitchen. The agent successfully fulfills the task it finally stops within $\epsilon$ meters of an instance of the requested object (green,255;blue,0,draw,scale=0.4,star point ratio=2.25] ;). Here, the Anna feedback is simulated using two pre-collected language-assisted routes (green,220;blue,0,scale=0.5](0,0) .. controls (0,0.5) and (0.5,0) .. (0.5,0.5); and green,220;blue,220,scale=0.5](0,0) .. controls (0,0.5) and (0.5,0) .. (0.5,0.5);).
  • Figure 2: Our hierarchical recurrent model architecture (the navigation network). The help-request network is mostly similar except that the navigation action distribution is fed as an input to compute the "state features".
  • Figure 3: Help-request behavior on TestUnseenAll: (a) fraction of time steps where the agent requests help and (b) predicted and true condition distributions. The already_asked condition is the negation of the never_asked condition.
  • Figure 4: Accuracy, precision, recall, and F-1 scores in predicting the help-request conditions on TestUnseenAll. The already_asked condition is the negation of the never_asked condition.

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • Lemma 1
  • proof
  • Lemma 2