Table of Contents
Fetching ...

Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study

Mykola Maslych, Christian Pumarada, Amirpouya Ghasemaghaei, Joseph J. LaViola

TL;DR

The paper addresses enabling realistic and responsive LLM-powered avatars in VR by deploying a local LLM pipeline with ASR, TTS, and lip-sync. It compares an LLM-based state-machine approach to a RAG-grounded generation pipeline for task-oriented VR interactions and demonstrates both in a pilot study with three avatars and a digital twin demo. Key findings include the importance of explicit processing feedback for perceived responsiveness, the impact of latency on realism, and the potential of RAG to ground safety training content. The work offers actionable guidance on open-source toolchains, avatar design, and methodological considerations for future VR-AI systems.

Abstract

We present a virtual reality (VR) environment featuring conversational avatars powered by a locally-deployed LLM, integrated with automatic speech recognition (ASR), text-to-speech (TTS), and lip-syncing. Through a pilot study, we explored the effects of three types of avatar status indicators during response generation. Our findings reveal design considerations for improving responsiveness and realism in LLM-driven conversational systems. We also detail two system architectures: one using an LLM-based state machine to control avatar behavior and another integrating retrieval-augmented generation (RAG) for context-grounded responses. Together, these contributions offer practical insights to guide future work in developing task-oriented conversational AI in VR environments.

Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study

TL;DR

The paper addresses enabling realistic and responsive LLM-powered avatars in VR by deploying a local LLM pipeline with ASR, TTS, and lip-sync. It compares an LLM-based state-machine approach to a RAG-grounded generation pipeline for task-oriented VR interactions and demonstrates both in a pilot study with three avatars and a digital twin demo. Key findings include the importance of explicit processing feedback for perceived responsiveness, the impact of latency on realism, and the potential of RAG to ground safety training content. The work offers actionable guidance on open-source toolchains, avatar design, and methodological considerations for future VR-AI systems.

Abstract

We present a virtual reality (VR) environment featuring conversational avatars powered by a locally-deployed LLM, integrated with automatic speech recognition (ASR), text-to-speech (TTS), and lip-syncing. Through a pilot study, we explored the effects of three types of avatar status indicators during response generation. Our findings reveal design considerations for improving responsiveness and realism in LLM-driven conversational systems. We also detail two system architectures: one using an LLM-based state machine to control avatar behavior and another integrating retrieval-augmented generation (RAG) for context-grounded responses. Together, these contributions offer practical insights to guide future work in developing task-oriented conversational AI in VR environments.
Paper Structure (18 sections, 3 figures, 1 table)

This paper contains 18 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Pipeline for generating responses to user's queries. Left -- architecture overview: ASR transcribes user's voice, passing it to Conversation Handler, which uses an LLM to generate a text response that gets voiced by Edge-TTS. Right -- Conversation Handler: state management system for controlling agent's behavior. Each state contains agent behavior that gets appended as a system message upon a transition to that state; states with outgoing transitions also contain transition conditions and few-shot examples of transition decisions. Transitions are decided by an LLM, which is instructed to return "transition" / "no transition" responses through system prompts with the last few messages from user-avatar history inserted in-between.
  • Figure 2: Pilot study results: (a) survey responses about avatar realism and responsiveness; (b) preferred wait feedback types; (c) number of conversational turns required to complete the in-VR scenario n-th time; (d) participant's head gaze deviation angle (from directly looking at the avatar's face) during n-th scenario completion.
  • Figure 3: Pipeline for the RAG-enhanced system architecture for answering user's queries about a specific application and machine. After user's speech is transcribed with ASR, alternative formulations of their query are used to retrieve closest matches of text chunks from a machine's manual. This additional context is provided to the LLM as an appended system message.