Table of Contents
Fetching ...

Explore and Explain: Self-supervised Navigation and Recounting

Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

TL;DR

The paper tackles joint embodied navigation and natural language recounting in unseen indoor environments by introducing eX^2 (Explore and Explain). It combines a curiosity-driven, self-supervised navigation module with a forward–inverse dynamics pair and a penalty to promote diverse exploration, and a fully-attentive Transformer-based captioning module that describes egocentric views; a speaker policy governs when to generate captions using object-, depth-, or curiosity-driven cues. Evaluations on Matterport3D via Habitat show improved navigation surprisal and captioning coverage/diversity, with strong generalization to unseen scenes and interpretable cross-modal behavior enabled by the captioning component. Overall, the work advances interpretable, coupled-task embodied AI by effectively linking exploration with context-rich recounting through multiple speaker strategies and Transformer-based captioning.

Abstract

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

Explore and Explain: Self-supervised Navigation and Recounting

TL;DR

The paper tackles joint embodied navigation and natural language recounting in unseen indoor environments by introducing eX^2 (Explore and Explain). It combines a curiosity-driven, self-supervised navigation module with a forward–inverse dynamics pair and a penalty to promote diverse exploration, and a fully-attentive Transformer-based captioning module that describes egocentric views; a speaker policy governs when to generate captions using object-, depth-, or curiosity-driven cues. Evaluations on Matterport3D via Habitat show improved navigation surprisal and captioning coverage/diversity, with strong generalization to unseen scenes and interpretable cross-modal behavior enabled by the captioning component. Overall, the work advances interpretable, coupled-task embodied AI by effectively linking exploration with context-rich recounting through multiple speaker strategies and Transformer-based captioning.

Abstract

Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.

Paper Structure

This paper contains 14 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We propose a novel setting in which an embodied agent performs joint curiosity-driven exploration and explanation in unseen environments. While navigating the environment, the agent must produce informative descriptions of what it sees, providing a means of interpreting its internal state.
  • Figure 2: Overview of our $\mathsf{eX}^2$ framework for navigation and captioning. Our model is composed of three main components: a navigation module which is in charge of exploring the environment, a captioning module which produces a textual sentence describing the agent point of view, and a speaker policy that connects the previous modules and activates the captioning component based on the information collected during the navigation.
  • Figure 3: Qualitative results of the agent trajectories in sample navigation episodes.
  • Figure 4: Sentences generated on sample images extracted from $\mathsf{eX}^2$ navigation trajectories. For each image, we report the relevant objects present on the scene and we underline their mentions in the caption.