Table of Contents
Fetching ...

Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks

Yufei He, Xucong Zhang, Arno H. A. Stienen

TL;DR

The paper tackles intention detection in egocentric grasping by predicting future hand motions from history, gaze, and object context. It introduces a two-component framework combining a Hand Motion VQ-VAE for discrete pose encoding with a decoder-only Transformer-based Hand Motion Generator that autoregressively predicts hand-motion sequences conditioned on gaze and object cues. Evaluations on a dataset of 15 subjects demonstrate that incorporating gaze significantly improves early predictions and generalization across subjects and motions. The approach shows promise for real-time, gaze-guided assistance in neurorehabilitation robotics by providing robust, proactive hand-motion predictions.

Abstract

Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions. This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects. To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.

Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks

TL;DR

The paper tackles intention detection in egocentric grasping by predicting future hand motions from history, gaze, and object context. It introduces a two-component framework combining a Hand Motion VQ-VAE for discrete pose encoding with a decoder-only Transformer-based Hand Motion Generator that autoregressively predicts hand-motion sequences conditioned on gaze and object cues. Evaluations on a dataset of 15 subjects demonstrate that incorporating gaze significantly improves early predictions and generalization across subjects and motions. The approach shows promise for real-time, gaze-guided assistance in neurorehabilitation robotics by providing robust, proactive hand-motion predictions.

Abstract

Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions. This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects. To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.

Paper Structure

This paper contains 18 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of gaze-guided human intention detection. Left: The user wears eye-tracking glasses to capture gaze fixation points (purple ray), initial hand motion (purple arrow), and object locations as input. Right: Using egocentric-view data, including eye fixation points (purple dot) and hand motion (blue hand skeletons), the system predicts a sequence of hand motions leading to the final grasping action (green arrow).
  • Figure 2: Overview of our framework for hand motion prediction. The proposed method consists of two main components: (a) Hand-Motion VQ-VAE, which encodes hand motion into codebook $C$ with indices $S$; and (b) Hand Motion Generator, which contains feature fusion layers and a transformer. In feature fusion layers, the encoded hand motion $S$ is integrated with eye-gaze and object features $G$ and $O$, together forming fused feature $X$. The transformer predicts future hand motion indices in an auto-regressive manner using a transformer architecture. These indices are subsequently decoded using the VQ-VAE decoder to obtain the predicted hand motions.
  • Figure 3: Data processing pipeline. This figure illustrates the sequence of steps applied to process egocentric video data for analysis: (a) Raw 2D images are captured from an egocentric-view video. (b) Throughout the entire sequence, the Mediapipe framework and Aria MPS are utilized to extract 3D hand motion, while Aria MPS extracts 3D gaze points. (c) The object representation is manually annotated on the first frame of the video. (d) A world coordinate is employed to integrate the hand-gaze sequence with the object representation into a unified 3D world frame.
  • Figure 4: Position Errors (in $m$) across Various Input Frames and Time (in $s$). This figure displays the end-pose (first row) and average (second row) position errors within the CS, CM, and CSM groups across different numbers of input frames and time before contact. Red lines represent results with gaze, and green lines represent results without gaze. Gray dashed lines at the bottom represent the position error calculated directly through the encoder and decoder of the hand-pose VQ-VAE.
  • Figure 5: An example of hand sequential position predictions from top view. The red and blue dots indicate the predicted right and left hand positions respectively, linked by arrows to be the prediction sequences. The stars mark the final targets for both hands, surrounded by dashed circles denoted as the 'target zone'. The axis has a unit of meters.