Table of Contents
Fetching ...

Improving Zero-Shot ObjectNav with Generative Communication

Vishnu Sashank Dorbala, Vishnu Dutt Sharma, Pratap Tokekar, Dinesh Manocha

TL;DR

This work identifies the novel linguistic trait of preemptive hallucination in the embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance.

Abstract

We propose a new method for improving zero-shot ObjectNav that aims to utilize potentially available environmental percepts for navigational assistance. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (-13% in OSR and -13% in SPL) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% OSR and 7.65% SPL improvement. To explain navigation performance, we analyze the GC for unique traits, quantifying the presence of hallucination and cooperation. Specifically, we identify the novel linguistic trait of preemptive hallucination in our embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance. We conduct real-world experiments and present some qualitative examples where we mitigate hallucinations via prompt finetuning to improve ObjectNav performance.

Improving Zero-Shot ObjectNav with Generative Communication

TL;DR

This work identifies the novel linguistic trait of preemptive hallucination in the embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance.

Abstract

We propose a new method for improving zero-shot ObjectNav that aims to utilize potentially available environmental percepts for navigational assistance. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (-13% in OSR and -13% in SPL) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% OSR and 7.65% SPL improvement. To explain navigation performance, we analyze the GC for unique traits, quantifying the presence of hallucination and cooperation. Specifically, we identify the novel linguistic trait of preemptive hallucination in our embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance. We conduct real-world experiments and present some qualitative examples where we mitigate hallucinations via prompt finetuning to improve ObjectNav performance.
Paper Structure (33 sections, 2 equations, 14 figures, 2 tables)

This paper contains 33 sections, 2 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview: We tackle zero-shot ObjectNav in an assisted setup, where the ground agent aims to improve performance by seeking assistance from other available environmental percepts. We consider an overhead agent (as shown) with a clear view of the target and a ground agent with an obstructed view of the target that convey environmental information to each other via freeform, unconstrained Generative Communication (GC). We use GC to develop two novel assisted navigation schemes and present results in both simulated and real-world environments, inferring that GC is useful only in a selective setup where the ground agent retains its independent exploration capability.
  • Figure 2: Approach: We consider 3 different setups for assisted ObjectNav on a ground agent (GA) using an overhead agent (OA). The No Comm. case (brown arrows) is a baseline ObjectNav setup where the GA is prompted directly by a VLM for navigation actions for the ground agent. This is illustrated on the left. For the remaining two cases, both agents first go through a Comm. phase ($\mathcal{C}$) for a fixed number of interactions $C_{\text{len}}$. We then summarize the dialogues for decision-making. In the Cooperative Action case (blue arrows), we pass the Generative Communication (GC) to an LLM that predicts an action for the GA. In the Selective Execution case (green arrows), the GA's VLM is prompted with the suggested action and asked if it wants to cooperate with the LLM prediction. If not, it performs independent exploration like the No. Comm. case. We later analyze the dialogue generated to measure generative communication traits.
  • Figure 3: Dialogue Hallucinations$\mathcal{H}$: We study hallucinations in dialogues to explain assisted ObjectNav performance. In the left figure, the target object is a PepperShaker, showing examples of hallucinations at different communication lengths. The right figure compares hallucinations ($\mathcal{H}_{PE}$, $\mathcal{H}_{GO}$), Cooperation Rate $\mathcal{CR}$ and Dialogue Similarity $\mathcal{DS}$ between cooperative and selective actions. Among all the traits, notice that $\mathcal{H}_{PE}$ stands out, getting worse as communication length increases, despite explicitly prompting for concise information.
  • Figure 4: Real World Experiments: We carry out a real-world experiment with a Turtlebot as a Ground Agent (GA) and a GoPro camera mounted to the roof as an Overhead Agent (OA) in various environment settings. Note the incorrect action taken in the cooperative execution case (red arrows) in comparison to the selective case (green arrows). The actions predicted are in yellow. Section \ref{['sec:quant_rw']} below discusses various hallucinations with different environment settings we encounter and how we finetune VLM prompts for better results.
  • Figure 5: Top-View Localization: Notice the poor localization of GPT-4V on an overhead image when asked to identify the location of the robot. Adding an orange marker on the robot helps alleviate this issue.
  • ...and 9 more figures