UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

Niyati Rawal; Roberto Bigazzi; Lorenzo Baraldi; Rita Cucchiara

UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara

TL;DR

This work tackles autonomous navigation with interactive natural-language dialogue by introducing UNMuTe, a two-component architecture that jointly learns a GPT-2–based multimodal dialogue generator and a modified DUET navigator. Dialogue is generated when the navigator is uncertain, using an entropy-based trigger with a learnable threshold to produce on-demand questions and answers that refer to current and future frames along the trajectory. The approach is trained in two stages and augmented with a policy that balances dialogue generation and navigation, achieving state-of-the-art results on both CVDN and NDH benchmarks, while providing interpretable, human-readable dialogue samples. The results demonstrate that synthetic, target-driven dialogue can effectively guide navigation, with practical implications for human-in-the-loop robotics and multimodal AI systems.

Abstract

Smart autonomous agents are becoming increasingly important in various real-life applications, including robotics and autonomous vehicles. One crucial skill that these agents must possess is the ability to interact with their surrounding entities, such as other agents or humans. In this work, we aim at building an intelligent agent that can efficiently navigate in an environment while being able to interact with an oracle (or human) in natural language and ask for directions when it is unsure about its navigation performance. The interaction is started by the agent that produces a question, which is then answered by the oracle on the basis of the shortest trajectory to the goal. The process can be performed multiple times during navigation, thus enabling the agent to hold a dialogue with the oracle. To this end, we propose a novel computational model, named UNMuTe, that consists of two main components: a dialogue model and a navigator. Specifically, the dialogue model is based on a GPT-2 decoder that handles multimodal data consisting of both text and images. First, the dialogue model is trained to generate question-answer pairs: the question is generated using the current image, while the answer is produced leveraging future images on the path toward the goal. Subsequently, a VLN model is trained to follow the dialogue predicting navigation actions or triggering the dialogue model if it needs help. In our experimental analysis, we show that UNMuTe achieves state-of-the-art performance on the main navigation tasks implying dialogue, i.e. Cooperative Vision and Dialogue Navigation (CVDN) and Navigation from Dialogue History (NDH), proving that our approach is effective in generating useful questions and answers to guide navigation.

UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 5 figures, 8 tables)

This paper contains 16 sections, 2 equations, 5 figures, 8 tables.

Introduction
Related Work
Vision-and-Language Navigation
Vision-and-Dialogue Navigation
Text Generation for Visual Navigation
Proposed Method
Dialogue Model
Navigator Model
Dialogue Exchange during Navigation
Experiments
Experimental Setup
CVDN Experiments
NDH Task
Dialogue Generation
Qualitative Generation Samples
...and 1 more sections

Figures (5)

Figure 1: We propose a novel computational model that learns to exchange dialogue during navigation when the agent is unsure of the action it should take in the environment. Our proposed model allows the agent to (a) decide when to ask a question, (b) ask target-driven questions, (c) answer given questions, and more importantly, (d) navigate toward the goal.
Figure 2: UNMuTe consists of a dialogue model that is based on a GPT-2 decoder and a navigation model that is based on a state-of-the-art navigator, i.e. DUET chen2022think. When DUET is unsure of the action the agent should take, it outputs an action that prompts the dialogue model to generate a question and an answer regarding where the agent should move.
Figure 3: Dialogue model with corresponding inputs and outputs. The model is trained to predict the subsequent language token belonging to the sequence. To facilitate graphical presentation special tokens such as BOS or EOS are omitted.
Figure 4: Probability distributions of the entropy of the action probability and the temporal distances between dialogues on the training split of CVDN.
Figure 5: Sample paths taken from the CVDN "val unseen" split, together with the corresponding ground-truth interactions and generated ones. The number of depicted steps has been artificially reduced to $6$ to facilitate the graphical presentation. We only show the frontal image of the panoramic observation at each timestep.

UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

TL;DR

Abstract

UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)