Table of Contents
Fetching ...

HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality

Zhiming Hu, Guanhua Zhang, Zheming Yin, Daniel Haeufle, Syn Schmitt, Andreas Bulling

TL;DR

HaHeAE addresses the lack of generalisable joint models for human hand and head movements in XR by presenting a self-supervised autoencoder that fuses a GCN-based semantic encoder with a DDIM-based stochastic encoder and decoder. The framework is augmented with hand-head forecasting to refine semantic features, and is validated across EgoBody, ADT, and GIMO datasets, achieving up to 74.1% reconstruction improvement over baselines and demonstrating strong cross-domain generalisation. The learned semantic representations enable interpretable hand-head clusters and controllable movement generation, while the stochastic component provides a reusable feature extractor for downstream tasks such as user identification and activity recognition. This work highlights the potential of self-supervised, joint hand-head modelling to enhance XR systems and motivates broader adoption and extension of diffusion-based representations in human-centred XR research.

Abstract

Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. We present HaHeAE - a novel self-supervised method for learning generalisable joint representations of hand and head movements in XR. At the core of our method is an autoencoder (AE) that uses a graph convolutional network-based semantic encoder and a diffusion-based stochastic encoder to learn the joint semantic and stochastic representations of hand-head movements. It also features a diffusion-based decoder to reconstruct the original signals. Through extensive evaluations on three public XR datasets, we show that our method 1) significantly outperforms commonly used self-supervised methods by up to 74.0% in terms of reconstruction quality and is generalisable across users, activities, and XR environments, 2) enables new applications, including interpretable hand-head cluster identification and variable hand-head movement generation, and 3) can serve as an effective feature extractor for downstream tasks. Together, these results demonstrate the effectiveness of our method and underline the potential of self-supervised methods for jointly modelling hand-head behaviours in extended reality.

HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality

TL;DR

HaHeAE addresses the lack of generalisable joint models for human hand and head movements in XR by presenting a self-supervised autoencoder that fuses a GCN-based semantic encoder with a DDIM-based stochastic encoder and decoder. The framework is augmented with hand-head forecasting to refine semantic features, and is validated across EgoBody, ADT, and GIMO datasets, achieving up to 74.1% reconstruction improvement over baselines and demonstrating strong cross-domain generalisation. The learned semantic representations enable interpretable hand-head clusters and controllable movement generation, while the stochastic component provides a reusable feature extractor for downstream tasks such as user identification and activity recognition. This work highlights the potential of self-supervised, joint hand-head modelling to enhance XR systems and motivates broader adoption and extension of diffusion-based representations in human-centred XR research.

Abstract

Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. We present HaHeAE - a novel self-supervised method for learning generalisable joint representations of hand and head movements in XR. At the core of our method is an autoencoder (AE) that uses a graph convolutional network-based semantic encoder and a diffusion-based stochastic encoder to learn the joint semantic and stochastic representations of hand-head movements. It also features a diffusion-based decoder to reconstruct the original signals. Through extensive evaluations on three public XR datasets, we show that our method 1) significantly outperforms commonly used self-supervised methods by up to 74.0% in terms of reconstruction quality and is generalisable across users, activities, and XR environments, 2) enables new applications, including interpretable hand-head cluster identification and variable hand-head movement generation, and 3) can serve as an effective feature extractor for downstream tasks. Together, these results demonstrate the effectiveness of our method and underline the potential of self-supervised methods for jointly modelling hand-head behaviours in extended reality.

Paper Structure

This paper contains 52 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Architecture of HaHeAE. Our method uses a GCN-based semantic encoder to learn the joint semantic representation of hand-head movements and a DDIM-based stochastic encoder to encode the remaining stochastic variations. A DDIM-based hand-head decoder reconstructs the original input signals from these semantic and stochastic representations. Hand-head forecasting is used as an auxiliary training task to refine the semantic representation.
  • Figure 2: The ResConv1D module used in our method is conditioned on the time step embedding $E_t$ and the semantic representation $E_{sem}$.
  • Figure 3: Representative hand and head movements for each of the four largest clusters and their semantics on the EgoBody dataset. The red and blue lines indicate the trajectories of left and right hands, respectively, while the black arrows denote head orientations. The colours of the lines and arrows are gradually deepened over time.
  • Figure 4: Hand and head movements generated from altered stochastic representations and random noise on the EgoBody dataset. Our method can generate variable and realistic hand-head movements from altered stochastic representations and random noise.