From Voices to Worlds: Developing an AI-Powered Framework for 3D Object Generation in Augmented Reality
Majid Behravan, Denis Gracanin
TL;DR
This work presents Matrix, an open source AI powered framework for real-time speech driven 3D object generation in AR. It integrates multilingual speech to text, a text to 3D generator, and LLM based semantic understanding to deliver context aware object creation and recommendations with reduced latency. Key contributions include a memory efficient mesh simplification pipeline, a pre generated object repository with vector based retrieval, and a modular architecture suitable for local deployment on devices like HoloLens 2. The results demonstrate feasible real time AR generation with substantial reductions in mesh size and improved responsiveness, with potential applications in education, design, and accessibility, and it outlines future directions including image to 3D conversion and multimodal AR reasoning.
Abstract
This paper presents Matrix, an advanced AI-powered framework designed for real-time 3D object generation in Augmented Reality (AR) environments. By integrating a cutting-edge text-to-3D generative AI model, multilingual speech-to-text translation, and large language models (LLMs), the system enables seamless user interactions through spoken commands. The framework processes speech inputs, generates 3D objects, and provides object recommendations based on contextual understanding, enhancing AR experiences. A key feature of this framework is its ability to optimize 3D models by reducing mesh complexity, resulting in significantly smaller file sizes and faster processing on resource-constrained AR devices. Our approach addresses the challenges of high GPU usage, large model output sizes, and real-time system responsiveness, ensuring a smoother user experience. Moreover, the system is equipped with a pre-generated object repository, further reducing GPU load and improving efficiency. We demonstrate the practical applications of this framework in various fields such as education, design, and accessibility, and discuss future enhancements including image-to-3D conversion, environmental object detection, and multimodal support. The open-source nature of the framework promotes ongoing innovation and its utility across diverse industries.
