SIAgent: Spatial Interaction Agent via LLM-powered Eye-Hand Motion Intent Understanding in VR

Zhimin Wang; Chenyu Gu; Feng Lu

SIAgent: Spatial Interaction Agent via LLM-powered Eye-Hand Motion Intent Understanding in VR

Zhimin Wang, Chenyu Gu, Feng Lu

TL;DR

SIAgent is proposed, a novel "Intent-to-Operation" framework allowing users to express interaction intents through natural eye-hand motions based on common sense and habits, and offers valuable insights into enhancing VR interaction intelligence through intent-driven design.

Abstract

Eye-hand coordinated interaction is becoming a mainstream interaction modality in Virtual Reality (VR) user interfaces.Current paradigms for this multimodal interaction require users to learn predefined gestures and memorize multiple gesture-task associations, which can be summarized as an ``Operation-to-Intent" paradigm. This paradigm increases users' learning costs and has low interaction error tolerance. In this paper, we propose SIAgent, a novel "Intent-to-Operation" framework allowing users to express interaction intents through natural eye-hand motions based on common sense and habits. Our system features two main components: (1) intent recognition that translates spatial interaction data into natural language and infers user intent, and (2) agent-based execution that generates an agent to execute corresponding tasks. This eliminates the need for gesture memorization and accommodates individual motion preferences with high error tolerance. We conduct two user studies across over 60 interaction tasks, comparing our method with two "Operation-to-Intent" techniques. Results show our method achieves higher intent recognition accuracy than gaze + pinch interaction (97.2% vs 93.1%) while reducing arm fatigue and improving usability, and user preference. Another study verifies the function of eye gaze and hand motion channels in intent recognition. Our work offers valuable insights into enhancing VR interaction intelligence through intent-driven design. Our source code and LLM prompts will be made available upon publication.

SIAgent: Spatial Interaction Agent via LLM-powered Eye-Hand Motion Intent Understanding in VR

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 9 figures, 3 tables)

This paper contains 29 sections, 5 equations, 9 figures, 3 tables.

Introduction
Related Works
Single-Modal Natural Interaction in VR
Multimodal Interaction in VR
Applications of AI and LLMs in VR
Motivations and Challenges
Design of SIAgent
Spatial-to-Linguistic Translation
Interaction Intent Recognition
Agent-based Execution
System Implementation and Apparatus
Evalution Design
Research Objectives
Participants
Task Design and Scenarios
...and 14 more sections

Figures (9)

Figure 1: The conventional Operation-to-Intent paradigm (left) requires users to perform sequential operations such as gaze-based pointing, gesture confirmation, and manipulation, resulting in high learning costs and low error tolerance. In contrast, our proposed Intent-to-Operation paradigm (right) allows users to naturally express intent through eye-hand motion, which is then recognized and executed by an agent powered by LLMs, enabling intuitive, flexible, and robust interaction.
Figure 2: This Operation-to-Intent paradigm presents several limitations. (a) Users must memorize multiple gesture-task associations, raising learning costs. (2) Users need to coordinate gaze and hand movements, increasing complexity and mental burden. (c) Users must achieve precise gesture matching, with recognition accuracy issues boosting interaction errors.
Figure 3: The pipeline of SIAgent. User eye-hand motions are first captured and translated into natural language descriptions through spatial-to-linguistic translation. The LLM then performs intent recognition to infer possible user intents for selection. Subsequently, the LLM generates executable parameters for agent-driven spatial interaction based on the confirmed intent.
Figure 4: Based on observations of finger shape patterns during hand-object interaction, five hand types are defined: (a) open, (b) half-grip, (c) tight-grip, (d) tip pinch, and (e) index tap.
Figure 5: We demonstrate intent recognition results for two tasks: (a) adjusting the lamp's lighting, and (b) seasoning the fish.
...and 4 more figures

SIAgent: Spatial Interaction Agent via LLM-powered Eye-Hand Motion Intent Understanding in VR

TL;DR

Abstract

SIAgent: Spatial Interaction Agent via LLM-powered Eye-Hand Motion Intent Understanding in VR

Authors

TL;DR

Abstract

Table of Contents

Figures (9)