Table of Contents
Fetching ...

Embodied Agents for Efficient Exploration and Smart Scene Description

Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

TL;DR

This work tackles efficient exploration and human-understandable scene description by embedding a navigator, a captioner, and a speaker policy into a cohesive pipeline. The navigator builds a neural occupancy map with a hierarchical policy to explore and map unseen indoor spaces, while the captioner, powered by a CLIP-guided Transformer, generates informative captions conditioned on visual input. A speaker policy regulates when to describe scenes, using depth, object presence, or visual activations to avoid redundant narration. The authors introduce the Episode Description Score ($\mathsf{ED}\text{-}\mathsf{S}$) to jointly evaluate exploration and descriptive coverage, and demonstrate strong performance on Gibson and MP3D datasets, with successful real-world deployment on a LoCobot platform. The results show that CLIP-based captions and density-model inspired exploration yield the best performance, and the framework offers interpretable, human-friendly insights into robotic perception and navigation.

Abstract

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by highlighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.

Embodied Agents for Efficient Exploration and Smart Scene Description

TL;DR

This work tackles efficient exploration and human-understandable scene description by embedding a navigator, a captioner, and a speaker policy into a cohesive pipeline. The navigator builds a neural occupancy map with a hierarchical policy to explore and map unseen indoor spaces, while the captioner, powered by a CLIP-guided Transformer, generates informative captions conditioned on visual input. A speaker policy regulates when to describe scenes, using depth, object presence, or visual activations to avoid redundant narration. The authors introduce the Episode Description Score () to jointly evaluate exploration and descriptive coverage, and demonstrate strong performance on Gibson and MP3D datasets, with successful real-world deployment on a LoCobot platform. The results show that CLIP-based captions and density-model inspired exploration yield the best performance, and the framework offers interpretable, human-friendly insights into robotic perception and navigation.

Abstract

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by highlighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.
Paper Structure (11 sections, 9 equations, 4 figures, 4 tables)

This paper contains 11 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed approach for smart scene description, comprising a navigator, a speaker policy, and a captioner module.
  • Figure 2: Qualitative exploration trajectories of different navigation agents on the same episode.
  • Figure 3: A sample of agent observation and corresponding images used by the speaker policy to trigger the captioner.
  • Figure 4: Sample observations and corresponding captions generated by our model.