Table of Contents
Fetching ...

LLMR: Real-time Prompting of Interactive Worlds using Large Language Models

Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski-Fahey, Judith Amores Fernandez, Jaron Lanier

TL;DR

Large Language Model for Mixed Reality leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity.

Abstract

We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.

LLMR: Real-time Prompting of Interactive Worlds using Large Language Models

TL;DR

Large Language Model for Mixed Reality leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity.

Abstract

We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.
Paper Structure (56 sections, 6 equations, 17 figures, 5 tables, 3 algorithms)

This paper contains 56 sections, 6 equations, 17 figures, 5 tables, 3 algorithms.

Figures (17)

  • Figure 1: Large Language Model for Mixed Reality (LLMR) architecture for real-time interactive 3D scene generation. Starting from the left, a user prompt and the existing 3D scene ($\Omega$) are fed into the Planner (P) and Scene Analyzer (SA) modules, respectively. The Planner decomposes the user prompt into a sequence of sub-prompts, while the SA summarizes the current scene elements. These are then integrated with a Skill Library (SL) to guide the Builder (B) module, which generates the appropriate code. The Inspector (I) module iteratively checks the generated code for compilation and run-time errors. Upon receiving the green light from the Inspector, the code is compiled using the Roslyn Compiler and executed in the Unity Engine to produce the desired 3D scene and functionalities as specified by the user.
  • Figure 2: The Planner and its role in breaking down a user's high-level request into a sequence of manageable subtasks $(u_1, u_2, \ldots, u_n)$. The Planner engages in a user-oriented conversation to determine the appropriate scope and granularity of each subtask. Following this, the Builder executes the plan by generating code $(x_1, x_2, \ldots x_n)$ for each subtask, effectively carrying out the user's initial request.
  • Figure 3: Scene Analyzer module. The virtual scene, depicted in the bottom-left corner, is converted into a parsed scene hierarchy in JSON format. This, along with the user request, serves as input to the Scene Analyzer. The output is a filtered, relevant summary of the scene, which is then used for conditioning subsequent modules like the Builder. The process optimizes the utilization of the language model's fixed context window and enhances focus on objects relevant to the user prompt.
  • Figure 4: Builder-Inspector paradigm in LLMR. The Builder module $\mathrm{B}(x | u, s)$ generates code based on user input and current state. The generated code is then inspected by the Inspector module $\mathrm{I}(r,v|x,s)$ for compilation and run-time errors. If errors are found, indicated by verdict $v$, the Inspector provides suggestions $r$ for corrections. The process iterates until either the code passes inspection or a maximum number of inspections $T$ is reached. This feedback loop significantly enhances the quality of the generated scripts.
  • Figure 5: Skill Library module workflow. On the left, the module receives inputs from the Scene Analyzer and a user prompt "create a whale and make it swim happily". A list of skills is provided to the SL GPT module in its metaprompt, which also contains a high-level summary of available skills such as object retrieval and animation. The module then identifies and outputs the most relevant skills (in this case, object retriever and animation) to the Builder, which subsequently utilizes these tools for implementation.
  • ...and 12 more figures