TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation
Xingpeng Sun, Yiran Zhang, Xindi Tang, Amrit Singh Bedi, Aniket Bera
TL;DR
TrustNavGPT tackles the problem of navigating via human audio instructions when speaker uncertainty can mislead LLM-based planners. It fuses textual transcription with vocal affective cues through an audio-processing module and a vocal-cue model, feeding a combined prompt $\mathcal{P}(\mathcal{V})=W\oplus K$ into an LLM that produces a plan $S$, which is then translated into executable actions by a tool library; when ambiguity arises, a decision-making engine uses visual context to refine choices. A probabilistic confidence score $\mathcal{C}(\rho)$ based on KL divergence to a ground-truth distribution, along with MCQA-style planning and a perception-driven action module, yields robust navigation under disfluent and uncertain human guidance. Empirical results on the DNIA dataset, RoboTHOR, and real-world tests show substantial improvements in success rate, proximity to targets, and efficiency, and demonstrate resilience to adversarial token attacks. The work advances safe human-robot interaction by enabling robots to reason about not only what humans say, but how they say it, to reduce misguidance in audio-directed navigation.
Abstract
While LLMs are proficient at processing text in human conversations, they often encounter difficulties with the nuances of verbal instructions and, thus, remain prone to hallucinate trust in human command. In this work, we present TrustNavGPT, an LLM based audio guided navigation agent that uses affective cues in spoken communication elements such as tone and inflection that convey meaning beyond words, allowing it to assess the trustworthiness of human commands and make effective, safe decisions. Our approach provides a lightweight yet effective approach that extends existing LLMs to model audio vocal features embedded in the voice command and model uncertainty for safe robotic navigation.
