Table of Contents
Fetching ...

Can You Move These Over There? An LLM-based VR Mover for Supporting Object Manipulation

Xiangzhi Eric Wang, Zackary P. T. Sin, Ye Jia, Daniel Archer, Wynonna H. Y. Fong, Qing Li, Chen Li

TL;DR

VR Mover introduces an LLM-powered natural interface for VR object manipulation, combining pointing, speech, and memory-aware reasoning to support coarse-to-fine placement of multiple objects. The system integrates scene modelling, a user-centric augmentation pipeline, and real-time LLM-driven scene updates to produce rapid, structured API calls. In a user study, VR Mover reduced workload and arm fatigue and yielded higher usability and hedonic experience, particularly for multi-object tasks, though single-object mid-air manipulation saw limited gains. The work demonstrates practical benefits of language-enabled, context-aware interaction in VR and suggests design directions for more intuitive, efficient future interfaces. The results indicate that a natural, memory-informed LLM interface can complement traditional gizmos and hands for flexible VR object manipulation with broad applicability.

Abstract

In our daily lives, we can naturally convey instructions for the spatial manipulation of objects using words and gestures. Transposing this form of interaction into virtual reality (VR) object manipulation can be beneficial. We propose VR Mover, an LLM-empowered solution that can understand and interpret the user's vocal instruction to support object manipulation. By simply pointing and speaking, the LLM can manipulate objects without structured input. Our user study demonstrates that VR Mover enhances user usability, overall experience and performance on multi-object manipulation, while also reducing workload and arm fatigue. Users prefer the proposed natural interface for broad movements and may complementarily switch to gizmos or virtual hands for finer adjustments. These findings are believed to contribute to design implications for future LLM-based object manipulation interfaces, highlighting the potential for more intuitive and efficient user interactions in VR environments.

Can You Move These Over There? An LLM-based VR Mover for Supporting Object Manipulation

TL;DR

VR Mover introduces an LLM-powered natural interface for VR object manipulation, combining pointing, speech, and memory-aware reasoning to support coarse-to-fine placement of multiple objects. The system integrates scene modelling, a user-centric augmentation pipeline, and real-time LLM-driven scene updates to produce rapid, structured API calls. In a user study, VR Mover reduced workload and arm fatigue and yielded higher usability and hedonic experience, particularly for multi-object tasks, though single-object mid-air manipulation saw limited gains. The work demonstrates practical benefits of language-enabled, context-aware interaction in VR and suggests design directions for more intuitive, efficient future interfaces. The results indicate that a natural, memory-informed LLM interface can complement traditional gizmos and hands for flexible VR object manipulation with broad applicability.

Abstract

In our daily lives, we can naturally convey instructions for the spatial manipulation of objects using words and gestures. Transposing this form of interaction into virtual reality (VR) object manipulation can be beneficial. We propose VR Mover, an LLM-empowered solution that can understand and interpret the user's vocal instruction to support object manipulation. By simply pointing and speaking, the LLM can manipulate objects without structured input. Our user study demonstrates that VR Mover enhances user usability, overall experience and performance on multi-object manipulation, while also reducing workload and arm fatigue. Users prefer the proposed natural interface for broad movements and may complementarily switch to gizmos or virtual hands for finer adjustments. These findings are believed to contribute to design implications for future LLM-based object manipulation interfaces, highlighting the potential for more intuitive and efficient user interactions in VR environments.

Paper Structure

This paper contains 60 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: To move an object, (a) the user can first specify an object and its target by pointing. However, as VR Mover is aware of what the user is seeing, (b) the user can simply directly use speech to refer to the cactus. Note that, p0 and p1 are the first and second hit points from pointing and (c) a hit point from pointing is visualized.
  • Figure 2: VR Mover can handle complex instructions such as the user using asynchronous multi-object manipulation where objects are applied with different manipulation (e.g. different movement) while mixing different manipulation operations (e.g. moving and rotating).
  • Figure 3: By drawing a line (lining), a user can express different manipulations. Here, we show the user using (a) a line to represent a moving vector, and (b) a line to indicate where the pictures should roughly be placed. VR Mover will determine which manipulation is being referred to, based on the user's instructions.
  • Figure 4: Empowered by LLM, VR Mover can demonstrate intelligent responses in some instances. (a) When the user requests four chairs and a table in the middle of the room, VR Mover is aware of the environment and able to place the objects in the room's center. Further, it has spatial common sense such that it knows the chairs should be facing the table. (b) As VR Mover is aware of the current context as well, when the user is referring to the chairs, it is likely to be the chairs that have just been manipulated. Lastly, (c), although we did not implement an undo function, VR Mover is adaptive enough to use the provided APIs to fulfill a user's undo request.
  • Figure 5: Different interaction methods can be used to engage with VR Mover.
  • ...and 16 more figures