Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

Xingpeng Sun; Haoming Meng; Souradip Chakraborty; Amrit Singh Bedi; Aniket Bera

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

Xingpeng Sun, Haoming Meng, Souradip Chakraborty, Amrit Singh Bedi, Aniket Bera

TL;DR

Beyond Text is presented, an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations, and achieves a 70.26% winning rate.

Abstract

While LLMs excel in processing text in these human conversations, they struggle with the nuances of verbal instructions in scenarios like social navigation, where ambiguity and uncertainty can erode trust in robotic and other AI systems. We can address this shortcoming by moving beyond text and additionally focusing on the paralinguistic features of these audio responses. These features are the aspects of spoken communication that do not involve the literal wording (lexical content) but convey meaning and nuance through how something is said. We present Beyond Text: an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations.This approach not only achieves a 70.26% winning rate, outperforming existing LLMs by 22.16% to 48.30% (gemini-1.5-pro and gpt-3.5 respectively), but also enhances robustness against token manipulation adversarial attacks, highlighted by a 22.44% less decrease ratio than the text-only language model in winning rate. Beyond Text' marks an advancement in social robot navigation and broader Human-Robot interactions, seamlessly integrating text-based guidance with human-audio-informed language models.

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 7 figures, 16 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 7 figures, 16 tables, 1 algorithm.

Introduction
Related Works
Our Approach
Semantic Uncertainty Quantification
Vocal Affective Cue Analysis
In-Context Prompting
Scoring Choices by Polling LLMs
New Dataset: Disfluent Navigational Instruction Audio Dataset (DNIA)
Experimental Results
Evaluation Metric: Confidence measure
Confidence Score Improvement
Choice Distribution Change
Winning Rate
Ablation Study
Robustness to Adversarial LLM Attacks
...and 17 more sections

Figures (7)

Figure 1: Current Large Language Models (LLMs) are unable to effectively interpret human vocal cues and accurately make decisions for audio-guided navigation involving ambiguous instructions.
Figure 2: Beyond Text Overview: With a human instruction audio clip, we simultaneously do transcription and vocal affective cue analysis. We prompt the language model to reason and generate five next-step choices. Then, we pick the action based on the highest next token logit probability. Note that with only the transcription model, the framework chooses the red choice. With affective analysis model, the green choice is picked.
Figure 3: a) Average Confidence Score by context and audio type for samples where the language model picks the same choices as human perception. The skinny bar is the error bar, representing the standard deviation of the confidence score. Error bars give a general idea of how precise a measurement is or, conversely, how far from the reported value the true (error-free) value might be. b) Average Confidence Score by context and audio type. An overall low variance is indicated by the increased average confidence.
Figure 4: Distribution of top log-probability choices by context and audio type. Our work (green) shows a dynamic change from Transcription Only (Blue). Instead of blindly choosing "A" and following the instruction with uncertainty, Beyond Text chooses "B", "C", "D", or "E", which successfully identifies the uncertainty component within the audio.
Figure 5: The Adversarial Language Model Attack Pipeline. An adversarial attack that paraphrases the input text to sound very certain by deleting textual uncertainty signals is applied.
...and 2 more figures

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

TL;DR

Abstract

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)