Table of Contents
Fetching ...

Analyzing Multimodal Interaction Strategies for LLM-Assisted Manipulation of 3D Scenes

Junlong Chen, Jens Grubert, Per Ola Kristensson

TL;DR

Through an empirical study, it is demonstrated that LLM-assisted interactive systems can be used productively in immersive environments and identified opportunities for improving natural language interfaces in 3D design tools and proposed design recommendations.

Abstract

As more applications of large language models (LLMs) for 3D content for immersive environments emerge, it is crucial to study user behaviour to identify interaction patterns and potential barriers to guide the future design of immersive content creation and editing systems which involve LLMs. In an empirical user study with 12 participants, we combine quantitative usage data with post-experience questionnaire feedback to reveal common interaction patterns and key barriers in LLM-assisted 3D scene editing systems. We identify opportunities for improving natural language interfaces in 3D design tools and propose design recommendations for future LLM-integrated 3D content creation systems. Through an empirical study, we demonstrate that LLM-assisted interactive systems can be used productively in immersive environments.

Analyzing Multimodal Interaction Strategies for LLM-Assisted Manipulation of 3D Scenes

TL;DR

Through an empirical study, it is demonstrated that LLM-assisted interactive systems can be used productively in immersive environments and identified opportunities for improving natural language interfaces in 3D design tools and proposed design recommendations.

Abstract

As more applications of large language models (LLMs) for 3D content for immersive environments emerge, it is crucial to study user behaviour to identify interaction patterns and potential barriers to guide the future design of immersive content creation and editing systems which involve LLMs. In an empirical user study with 12 participants, we combine quantitative usage data with post-experience questionnaire feedback to reveal common interaction patterns and key barriers in LLM-assisted 3D scene editing systems. We identify opportunities for improving natural language interfaces in 3D design tools and propose design recommendations for future LLM-integrated 3D content creation systems. Through an empirical study, we demonstrate that LLM-assisted interactive systems can be used productively in immersive environments.

Paper Structure

This paper contains 37 sections, 5 figures.

Figures (5)

  • Figure 1: Workflow of the AssistVR system designed for the study. In the training phase, only Azure Conversational Language Understanding (CLU) is involved. The developer labels a number of utterances with intents and entities, and finetunes the Azure CLU model. The model is iteratively improved based on performance metrics. In the deployment phase, Azure CLU classifies user speech input into different intents. If the intent falls under the 'Select', 'Deselect', 'Modify', or 'Undo' categories, further post-processing steps to modify the scene are conducted in Unity. If the intent does not fall under these categories, the user speech input and a text file containing the instructions prompt and scene graph of the current scene are sent to GPT-4o, which generates a natural language response synthesized into speech for the user.
  • Figure 2: Example of the draggable panel. The panel shows that the current object hit by the raycast is the 'Sofa' with 'Blue' color and 'Cotton' material. Currently, objects 'Table', 'Vase', and 'Bed' are selected. The user says, "Make them purple." The system responds, "Changing Table, Vase, Bed into purple," and modifies the color of the selected objects. At the bottom of the panel, screenshots of the target scene at different angles are shown.
  • Figure 3: Number of remaining elemental editing steps to match the target scene in Task 1 (top figure) and Task 2 (bottom figure). The horizontal axis is denoting relative time in minutes and seconds.
  • Figure 4: Interaction strategies adopted by different users across the duration of Task 1 (top left) and Task 2 (top right). Time on the horizontal axis is displayed in the format MM:SS (minutes:seconds). Triangles above the timeline of each participant indicate user queries. The main timeline bar for each participant indicates the high-level strategy employed (IE: Incremental Exploration, or BM: Bulk Modification). The secondary timeline bar below the main timeline bar indicates the low-level strategy employed, namely Bulk Modify Color (BM-Color), Bulk Modify Material (BM-Material), Color Editing with Incremental Exploration (IE-Color), Material Editing with Incremental Exploration (IE-Material), or Carpet Editing with Incremental Exploration (IE-Carpet). Grey tags below the timeline bars represent the current scene status (O: Original scene. T: Target scene. P: Partially-edited scene.) The target scene can be the grey scene in \ref{['fig:teaser']} (right), or the purple scene shown here. Each participant experienced both target scenes, and the order of the target scene in Task 1 and Task 2 is counterbalanced for all participants. Among partially-edited scenes, some scenes occur frequently and are labelled explicitly. These include: T*: In addition to T, one of the walls received an extra edit, REES=1. T**: In addition to T, the carpet material is incorrect, REES=1. A: Color/material changed for all objects except the carpet, REES=1. A*: In addition to A, one of the walls received an extra edit, REES=2. Example screenshots of these scenes are provided below the timeline.
  • Figure 5: Box plots of the time spent in minutes for all participants on the incremental exploration (IE) strategy and the bulk modification (BM) strategy (left), the percentage of time spent on both strategies for all participants in Task 1 and Task 2 (middle), and the number of queries posed during both strategies for all participants in Task 1 and Task 2 (right). Black squares indicate the mean values.