Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study
Mykola Maslych, Christian Pumarada, Amirpouya Ghasemaghaei, Joseph J. LaViola
TL;DR
The paper addresses enabling realistic and responsive LLM-powered avatars in VR by deploying a local LLM pipeline with ASR, TTS, and lip-sync. It compares an LLM-based state-machine approach to a RAG-grounded generation pipeline for task-oriented VR interactions and demonstrates both in a pilot study with three avatars and a digital twin demo. Key findings include the importance of explicit processing feedback for perceived responsiveness, the impact of latency on realism, and the potential of RAG to ground safety training content. The work offers actionable guidance on open-source toolchains, avatar design, and methodological considerations for future VR-AI systems.
Abstract
We present a virtual reality (VR) environment featuring conversational avatars powered by a locally-deployed LLM, integrated with automatic speech recognition (ASR), text-to-speech (TTS), and lip-syncing. Through a pilot study, we explored the effects of three types of avatar status indicators during response generation. Our findings reveal design considerations for improving responsiveness and realism in LLM-driven conversational systems. We also detail two system architectures: one using an LLM-based state machine to control avatar behavior and another integrating retrieval-augmented generation (RAG) for context-grounded responses. Together, these contributions offer practical insights to guide future work in developing task-oriented conversational AI in VR environments.
