RAG-VR: Leveraging Retrieval-Augmented Generation for 3D Question Answering in VR Environments
Shiyi Ding, Ying Chen
TL;DR
The paper addresses the challenge of applying large language models to VR where context is highly localized and personalized. It proposes RAG-VR, a retrieval-augmented 3D QA system running on an edge server, combining a VR knowledge extraction pipeline, a compact retrieval-augmented database, and a two-tower retriever training to select relevant facts. It demonstrates that RAG-VR improves answer accuracy by 17.9% to 41.8% and reduces end-to-end latency by 34.5% to 47.3% compared with two baselines, while generalizing across five VR scenes. The work shows that lightweight LLMs on edge hardware can support accurate, responsive VR QA, paving the way for scalable, context-aware VR applications.
Abstract
Recent advances in large language models (LLMs) provide new opportunities for context understanding in virtual reality (VR). However, VR contexts are often highly localized and personalized, limiting the effectiveness of general-purpose LLMs. To address this challenge, we present RAG-VR, the first 3D question-answering system for VR that incorporates retrieval-augmented generation (RAG), which augments an LLM with external knowledge retrieved from a localized knowledge database to improve the answer quality. RAG-VR includes a pipeline for extracting comprehensive knowledge about virtual environments and user conditions for accurate answer generation. To ensure efficient retrieval, RAG-VR offloads the retrieval process to a nearby edge server and uses only essential information during retrieval. Moreover, we train the retriever to effectively distinguish among relevant, irrelevant, and hard-to-differentiate information in relation to questions. RAG-VR improves answer accuracy by 17.9%-41.8% and reduces end-to-end latency by 34.5%-47.3% compared with two baseline systems.
