Table of Contents
Fetching ...

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

Xingpeng Sun, Haoming Meng, Souradip Chakraborty, Amrit Singh Bedi, Aniket Bera

TL;DR

Beyond Text is presented, an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations, and achieves a 70.26% winning rate.

Abstract

While LLMs excel in processing text in these human conversations, they struggle with the nuances of verbal instructions in scenarios like social navigation, where ambiguity and uncertainty can erode trust in robotic and other AI systems. We can address this shortcoming by moving beyond text and additionally focusing on the paralinguistic features of these audio responses. These features are the aspects of spoken communication that do not involve the literal wording (lexical content) but convey meaning and nuance through how something is said. We present Beyond Text: an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations.This approach not only achieves a 70.26% winning rate, outperforming existing LLMs by 22.16% to 48.30% (gemini-1.5-pro and gpt-3.5 respectively), but also enhances robustness against token manipulation adversarial attacks, highlighted by a 22.44% less decrease ratio than the text-only language model in winning rate. Beyond Text' marks an advancement in social robot navigation and broader Human-Robot interactions, seamlessly integrating text-based guidance with human-audio-informed language models.

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

TL;DR

Beyond Text is presented, an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations, and achieves a 70.26% winning rate.

Abstract

While LLMs excel in processing text in these human conversations, they struggle with the nuances of verbal instructions in scenarios like social navigation, where ambiguity and uncertainty can erode trust in robotic and other AI systems. We can address this shortcoming by moving beyond text and additionally focusing on the paralinguistic features of these audio responses. These features are the aspects of spoken communication that do not involve the literal wording (lexical content) but convey meaning and nuance through how something is said. We present Beyond Text: an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations.This approach not only achieves a 70.26% winning rate, outperforming existing LLMs by 22.16% to 48.30% (gemini-1.5-pro and gpt-3.5 respectively), but also enhances robustness against token manipulation adversarial attacks, highlighted by a 22.44% less decrease ratio than the text-only language model in winning rate. Beyond Text' marks an advancement in social robot navigation and broader Human-Robot interactions, seamlessly integrating text-based guidance with human-audio-informed language models.
Paper Structure (32 sections, 5 equations, 7 figures, 16 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: Current Large Language Models (LLMs) are unable to effectively interpret human vocal cues and accurately make decisions for audio-guided navigation involving ambiguous instructions.
  • Figure 2: Beyond Text Overview: With a human instruction audio clip, we simultaneously do transcription and vocal affective cue analysis. We prompt the language model to reason and generate five next-step choices. Then, we pick the action based on the highest next token logit probability. Note that with only the transcription model, the framework chooses the red choice. With affective analysis model, the green choice is picked.
  • Figure 3: a) Average Confidence Score by context and audio type for samples where the language model picks the same choices as human perception. The skinny bar is the error bar, representing the standard deviation of the confidence score. Error bars give a general idea of how precise a measurement is or, conversely, how far from the reported value the true (error-free) value might be. b) Average Confidence Score by context and audio type. An overall low variance is indicated by the increased average confidence.
  • Figure 4: Distribution of top log-probability choices by context and audio type. Our work (green) shows a dynamic change from Transcription Only (Blue). Instead of blindly choosing "A" and following the instruction with uncertainty, Beyond Text chooses "B", "C", "D", or "E", which successfully identifies the uncertainty component within the audio.
  • Figure 5: The Adversarial Language Model Attack Pipeline. An adversarial attack that paraphrases the input text to sound very certain by deleting textual uncertainty signals is applied.
  • ...and 2 more figures