CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR
Kadir Burak Buldu, Süleyman Özdel, Ka Hei Carrie Lau, Mengdi Wang, Daniel Saad, Sofie Schönborn, Auxane Boch, Enkelejda Kasneci, Efe Bozkir
TL;DR
CUIfy addresses the lack of open-source pipelines for LLM-powered speech-based NPCs in XR by providing a modular Python server and Unity client that stream STT-LLM-TTS interactions. It supports multiple NPCs, API-based and local models, per-NPC prompts/history, and Docker deployment to reduce compatibility issues, with a streaming pipeline to minimize latency. The contributions include a complete, extendable framework, documentation, and a demonstration platform for XR applications, enabling privacy-conscious, real-time conversational experiences. The work's practical impact lies in lowering barriers to building intelligent, voice-driven XR spaces for education, training, and entertainment.
Abstract
Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering affordable consumer-grade head-mounted displays (HMDs), XR will likely become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)--powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. This paper provides the community with an open-source, customizable, extendable, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with widely used LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify
