HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality
Zhiming Hu, Guanhua Zhang, Zheming Yin, Daniel Haeufle, Syn Schmitt, Andreas Bulling
TL;DR
HaHeAE addresses the lack of generalisable joint models for human hand and head movements in XR by presenting a self-supervised autoencoder that fuses a GCN-based semantic encoder with a DDIM-based stochastic encoder and decoder. The framework is augmented with hand-head forecasting to refine semantic features, and is validated across EgoBody, ADT, and GIMO datasets, achieving up to 74.1% reconstruction improvement over baselines and demonstrating strong cross-domain generalisation. The learned semantic representations enable interpretable hand-head clusters and controllable movement generation, while the stochastic component provides a reusable feature extractor for downstream tasks such as user identification and activity recognition. This work highlights the potential of self-supervised, joint hand-head modelling to enhance XR systems and motivates broader adoption and extension of diffusion-based representations in human-centred XR research.
Abstract
Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. We present HaHeAE - a novel self-supervised method for learning generalisable joint representations of hand and head movements in XR. At the core of our method is an autoencoder (AE) that uses a graph convolutional network-based semantic encoder and a diffusion-based stochastic encoder to learn the joint semantic and stochastic representations of hand-head movements. It also features a diffusion-based decoder to reconstruct the original signals. Through extensive evaluations on three public XR datasets, we show that our method 1) significantly outperforms commonly used self-supervised methods by up to 74.0% in terms of reconstruction quality and is generalisable across users, activities, and XR environments, 2) enables new applications, including interpretable hand-head cluster identification and variable hand-head movement generation, and 3) can serve as an effective feature extractor for downstream tasks. Together, these results demonstrate the effectiveness of our method and underline the potential of self-supervised methods for jointly modelling hand-head behaviours in extended reality.
