Table of Contents
Fetching ...

Generative AI Framework for 3D Object Generation in Augmented Reality

Majid Behravan

TL;DR

The work tackles real-time 3D object generation in augmented reality by proposing the Matrix framework, which combines Shap-E-based text-to-3D, vision-language models (VLMs), and large language models (LLMs) with multilingual speech-to-text and text-to-speech. It introduces three subsystems—Speech-to-3D, Context-Aware Object Recommendation, and Image-to-3D—along with a pre-generated object repository to reduce GPU load and latency, achieving generation times under $50$ seconds on AR devices. Thorough quantitative and qualitative evaluations (SUS, NASA-TLX, PQ, VLM accuracy, and replication studies) demonstrate improved usability, reduced latency, consistent object outputs, and context-aware recommendations, with practical performance gains over existing methods like Dream Mesh ($30$-$40$ minutes per object). The study demonstrates broad applicability across gaming, education, retail, and interior design, emphasizes democratization of 3D content creation, and outlines future directions such as automatic placement, multimodal interaction enhancements, and multi-user AR collaboration.

Abstract

This thesis presents a framework that integrates state-of-the-art generative AI models for real-time creation of three-dimensional (3D) objects in augmented reality (AR) environments. The primary goal is to convert diverse inputs, such as images and speech, into accurate 3D models, enhancing user interaction and immersion. Key components include advanced object detection algorithms, user-friendly interaction techniques, and robust AI models like Shap-E for 3D generation. Leveraging Vision Language Models (VLMs) and Large Language Models (LLMs), the system captures spatial details from images and processes textual information to generate comprehensive 3D objects, seamlessly integrating virtual objects into real-world environments. The framework demonstrates applications across industries such as gaming, education, retail, and interior design. It allows players to create personalized in-game assets, customers to see products in their environments before purchase, and designers to convert real-world objects into 3D models for real-time visualization. A significant contribution is democratizing 3D model creation, making advanced AI tools accessible to a broader audience, fostering creativity and innovation. The framework addresses challenges like handling multilingual inputs, diverse visual data, and complex environments, improving object detection and model generation accuracy, as well as loading 3D models in AR space in real-time. In conclusion, this thesis integrates generative AI and AR for efficient 3D model generation, enhancing accessibility and paving the way for innovative applications and improved user interactions in AR environments.

Generative AI Framework for 3D Object Generation in Augmented Reality

TL;DR

The work tackles real-time 3D object generation in augmented reality by proposing the Matrix framework, which combines Shap-E-based text-to-3D, vision-language models (VLMs), and large language models (LLMs) with multilingual speech-to-text and text-to-speech. It introduces three subsystems—Speech-to-3D, Context-Aware Object Recommendation, and Image-to-3D—along with a pre-generated object repository to reduce GPU load and latency, achieving generation times under seconds on AR devices. Thorough quantitative and qualitative evaluations (SUS, NASA-TLX, PQ, VLM accuracy, and replication studies) demonstrate improved usability, reduced latency, consistent object outputs, and context-aware recommendations, with practical performance gains over existing methods like Dream Mesh (- minutes per object). The study demonstrates broad applicability across gaming, education, retail, and interior design, emphasizes democratization of 3D content creation, and outlines future directions such as automatic placement, multimodal interaction enhancements, and multi-user AR collaboration.

Abstract

This thesis presents a framework that integrates state-of-the-art generative AI models for real-time creation of three-dimensional (3D) objects in augmented reality (AR) environments. The primary goal is to convert diverse inputs, such as images and speech, into accurate 3D models, enhancing user interaction and immersion. Key components include advanced object detection algorithms, user-friendly interaction techniques, and robust AI models like Shap-E for 3D generation. Leveraging Vision Language Models (VLMs) and Large Language Models (LLMs), the system captures spatial details from images and processes textual information to generate comprehensive 3D objects, seamlessly integrating virtual objects into real-world environments. The framework demonstrates applications across industries such as gaming, education, retail, and interior design. It allows players to create personalized in-game assets, customers to see products in their environments before purchase, and designers to convert real-world objects into 3D models for real-time visualization. A significant contribution is democratizing 3D model creation, making advanced AI tools accessible to a broader audience, fostering creativity and innovation. The framework addresses challenges like handling multilingual inputs, diverse visual data, and complex environments, improving object detection and model generation accuracy, as well as loading 3D models in AR space in real-time. In conclusion, this thesis integrates generative AI and AR for efficient 3D model generation, enhancing accessibility and paving the way for innovative applications and improved user interactions in AR environments.

Paper Structure

This paper contains 72 sections, 1 equation, 38 figures, 14 tables.

Figures (38)

  • Figure 1: Conceptual Model of the Framework for AR Interaction – This model illustrates the integration of the three main subsystems within the Matrix AR environment: Speech-to-3D, Context-Aware Object Recommendation, and Image-to-3D. The workflow demonstrates how user inputs (speech or images) are processed through ASR , LLM , and VLM to generate, recommend, and render 3D objects in real-time. Text embedding supports efficient retrieval from the vector database and 3D model repository, ensuring seamless user interaction and object customization in the AR space.
  • Figure 2: The overview of the developed speech-to-3D subsystem.
  • Figure 3: The illustrative AR application example: the language selection menu.
  • Figure 4: Objects repository semantic search.
  • Figure 5: Object selection menu example in the Matrix AR application. The menu displays three categories: Detected Objects, Repository Objects, and LLM Recommended Objects. The objects shown are: Detected Objects: Apple; Repository Objects: Orange, Banana, Plate, Table, Knife; LLM Recommended Objects: Pineapple, Orange, Mango, Watermelon, Strawberry.
  • ...and 33 more figures