Human Latency Conversational Turns for Spoken Avatar Systems

Derek Jacoby; Tianyi Zhang; Aanchan Mohan; Yvonne Coady

Human Latency Conversational Turns for Spoken Avatar Systems

Derek Jacoby, Tianyi Zhang, Aanchan Mohan, Yvonne Coady

TL;DR

This paper tackles the challenge of latency in LLM-driven spoken avatars by modeling real-time dialogue where the system may respond before a speaker finishes an utterance. It uses Google Natural Questions and truncation of final words to simulate late informative/uninformative turns, evaluating responses with SEMSCORE against ground-truth references and GPT-4 as the responder. Results show GPT-4 can effectively fill in missing context from a dropped word in over 60% of cases, suggesting filler-phrase strategies could maintain human-like turn-taking with acceptable quality loss. The work envisions a classifier-driven approach to decide when to deploy fillers and aims to advance avatar dialogue in museum settings through in-domain datasets and real-time processing capabilities.

Abstract

A problem with many current Large Language Model (LLM) driven spoken dialogues is the response time. Some efforts such as Groq address this issue by lightning fast processing of the LLM, but we know from the cognitive psychology literature that in human-to-human dialogue often responses occur prior to the speaker completing their utterance. No amount of delay for LLM processing is acceptable if we wish to maintain human dialogue latencies. In this paper, we discuss methods for understanding an utterance in close to real time and generating a response so that the system can comply with human-level conversational turn delays. This means that the information content of the final part of the speaker's utterance is lost to the LLM. Using the Google NaturalQuestions (NQ) database, our results show GPT-4 can effectively fill in missing context from a dropped word at the end of a question over 60% of the time. We also provide some examples of utterances and the impacts of this information loss on the quality of LLM response in the context of an avatar that is currently under development. These results indicate that a simple classifier could be used to determine whether a question is semantically complete, or requires a filler phrase to allow a response to be generated within human dialogue time constraints.

Human Latency Conversational Turns for Spoken Avatar Systems

TL;DR

Abstract

Paper Structure (13 sections, 3 figures)

This paper contains 13 sections, 3 figures.

Introduction
Background and Related Work
Assessing machine dialogues
Latencies in Human-to-Human Dialogue
Spoken Dialogue Theories
Latencies in Machine Dialogues
Methodology and Experimental Framework
Google NaturalQuestions Dataset
Responses from Large Language Models
Scoring similarity between responses
Results
Discussion
Conclusions and Future Work

Figures (3)

Figure 1: Histogram of SEMSCORES for LLM returned responses against the ground truth reference responses
Figure 2: Box and whisker plots of SEMSCORES for the similarity of 'res-0' vs truncated utterances, specifically those utterances whose 'res-0' scores when compared to the ground truth were above the 75th percentile score of 0.81
Figure 3: Number of questions (out of 1000) where the response was rated within 75th percentile of untruncated responses

Human Latency Conversational Turns for Spoken Avatar Systems

TL;DR

Abstract

Human Latency Conversational Turns for Spoken Avatar Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (3)