Reality Copilot: Voice-First Human-AI Collaboration in Mixed Reality Using Large Multimodal Models

Liuchuan Yu; Yongqi Zhang; Lap-Fai Yu

Reality Copilot: Voice-First Human-AI Collaboration in Mixed Reality Using Large Multimodal Models

Liuchuan Yu, Yongqi Zhang, Lap-Fai Yu

TL;DR

Reality Copilot addresses the need for natural voice-based collaboration in mixed reality by integrating commercial and open-source Large Multimodal Models into a privacy-preserving MR assistant. The system uses a stack-based context processing framework to convert voice input into voice responses and concrete actions, enabling tasks like real-time guidance, egocentric video narration, and 3D model generation. It demonstrates end-to-end implementation on Meta Quest 3 with on-device recording and local LMM processing, plus email export of assets and cross-platform content generation. The work contributes a practical blueprint for LMM-powered MR interaction and suggests new directions for immersive, multimodal human-AI collaboration.

Abstract

Large Multimodal Models (LMMs) have shown strong potential for assisting users in tasks, such as programming, content creation, and information access, yet their interaction remains largely limited to traditional interfaces such as desktops and smartphones. Meanwhile, advances in mixed reality (MR) hardware have enabled applications that extend beyond entertainment and into everyday use. However, most existing MR systems rely primarily on manual input (e.g., hand gestures or controllers) and provide limited intelligent assistance due to the lack of integration with large-scale AI models. We present Reality Copilot, a voice-first human-AI assistant for mixed reality that leverages LMMs to enable natural speech-based interaction. The system supports contextual understanding of physical environments, realistic 3D content generation, and real-time information retrieval. In addition to in-headset interaction, Reality Copilot facilitates cross-platform workflows by generating context-aware textual content and exporting generated assets. This work explores the design space of LMM-powered human-AI collaboration in mixed reality.

Reality Copilot: Voice-First Human-AI Collaboration in Mixed Reality Using Large Multimodal Models

TL;DR

Abstract

Paper Structure (11 sections, 3 figures)

This paper contains 11 sections, 3 figures.

Introduction
Related Work
Human-AI Collaboration in Mixed Reality
Large Multimodal Model Applications
Overview
Implementation
Applications
Real-Time Assistant
Egocentric Video Creation
3D Modeling Workflow Integration
Conclusion

Figures (3)

Figure 1: Workflow of Reality Copilot. When a user wears a mixed reality headset and launches Reality Copilot, they can interact using natural voice. The voice input, along with contextual information (e.g., user interface state and service availability), is sent to the LMMs. Reality Copilot integrates two types of LMMs: commercial and open-source. Voice inputs are processed by commercial LMMs, while image and 3D model processing is handled by open-source LMMs. The output consists of both voice responses and system actions, which are also used to update the internal context.
Figure 2: System internals of Reality Copilot: (a) Stack-based context processing; (b) Hardware-accelerated recording pipeline.
Figure 3: Application samples of Reality Copilot. In each subfigure, the left-side image shows a third-person view captured with an iPhone, while the right-side image presents the corresponding egocentric (user) view. (A) Reality Copilot, powered by LMMs, assists students in learning electronics by providing real-time, voice-driven guidance. (B) Reality Copilot supports egocentric video creation with real-time narration, aiding content creators in documenting handcrafting processes. (C) Reality Copilot enhances the 3D modeling workflow by enabling designers to generate realistic 3D models.

Reality Copilot: Voice-First Human-AI Collaboration in Mixed Reality Using Large Multimodal Models

TL;DR

Abstract

Reality Copilot: Voice-First Human-AI Collaboration in Mixed Reality Using Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)