Table of Contents
Fetching ...

TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

Xingpeng Sun, Yiran Zhang, Xindi Tang, Amrit Singh Bedi, Aniket Bera

TL;DR

TrustNavGPT tackles the problem of navigating via human audio instructions when speaker uncertainty can mislead LLM-based planners. It fuses textual transcription with vocal affective cues through an audio-processing module and a vocal-cue model, feeding a combined prompt $\mathcal{P}(\mathcal{V})=W\oplus K$ into an LLM that produces a plan $S$, which is then translated into executable actions by a tool library; when ambiguity arises, a decision-making engine uses visual context to refine choices. A probabilistic confidence score $\mathcal{C}(\rho)$ based on KL divergence to a ground-truth distribution, along with MCQA-style planning and a perception-driven action module, yields robust navigation under disfluent and uncertain human guidance. Empirical results on the DNIA dataset, RoboTHOR, and real-world tests show substantial improvements in success rate, proximity to targets, and efficiency, and demonstrate resilience to adversarial token attacks. The work advances safe human-robot interaction by enabling robots to reason about not only what humans say, but how they say it, to reduce misguidance in audio-directed navigation.

Abstract

While LLMs are proficient at processing text in human conversations, they often encounter difficulties with the nuances of verbal instructions and, thus, remain prone to hallucinate trust in human command. In this work, we present TrustNavGPT, an LLM based audio guided navigation agent that uses affective cues in spoken communication elements such as tone and inflection that convey meaning beyond words, allowing it to assess the trustworthiness of human commands and make effective, safe decisions. Our approach provides a lightweight yet effective approach that extends existing LLMs to model audio vocal features embedded in the voice command and model uncertainty for safe robotic navigation.

TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

TL;DR

TrustNavGPT tackles the problem of navigating via human audio instructions when speaker uncertainty can mislead LLM-based planners. It fuses textual transcription with vocal affective cues through an audio-processing module and a vocal-cue model, feeding a combined prompt into an LLM that produces a plan , which is then translated into executable actions by a tool library; when ambiguity arises, a decision-making engine uses visual context to refine choices. A probabilistic confidence score based on KL divergence to a ground-truth distribution, along with MCQA-style planning and a perception-driven action module, yields robust navigation under disfluent and uncertain human guidance. Empirical results on the DNIA dataset, RoboTHOR, and real-world tests show substantial improvements in success rate, proximity to targets, and efficiency, and demonstrate resilience to adversarial token attacks. The work advances safe human-robot interaction by enabling robots to reason about not only what humans say, but how they say it, to reduce misguidance in audio-directed navigation.

Abstract

While LLMs are proficient at processing text in human conversations, they often encounter difficulties with the nuances of verbal instructions and, thus, remain prone to hallucinate trust in human command. In this work, we present TrustNavGPT, an LLM based audio guided navigation agent that uses affective cues in spoken communication elements such as tone and inflection that convey meaning beyond words, allowing it to assess the trustworthiness of human commands and make effective, safe decisions. Our approach provides a lightweight yet effective approach that extends existing LLMs to model audio vocal features embedded in the voice command and model uncertainty for safe robotic navigation.
Paper Structure (17 sections, 8 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: The current navigation methods using Large Language Models (LLMs) struggle with making accurate decisions when faced with ambiguous audio instructions. Our strategy involves affective cues from spoken communication into LLMs, enabling them to evaluate the reliability of human instructions from the semantic and vocal uncertainty, thus allowing for safe and successful navigation.
  • Figure 2: Overview: Human audio goes through an audio-processing module that transcribes it, while a vocal cue model identifies three essential affective cues. We then prompt a language model to generate five possible next-step actions, selecting the choice based on the next token logit probability. Notably, semantic transcription alone leads to the red choice, but incorporating the vocal cue results in the green choice being selected. Finally, a tool library translates the chosen language instruction into agent actions for navigation.
  • Figure 3: Illustration of action sequences. The purple box shows the reference object. At the point the human is ambiguous, the robot sees a television on the right-hand side(pick box), and thus reasons that the television is near to the remote control, then moves to the right side instead of following the human instruction. Notably, without uncertainty analysis, the LLM navigation path is shown in red, leading in the wrong direction. The navigation result of our method is shown in green, arriving at the target(yellow box) successfully.
  • Figure 4: Real-world Navigation with vocal direction to Starbucks Coffee Shop. Successfully arrived at the target.