Table of Contents
Fetching ...

Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

Sinan Mutlu, Georgios F. Angelis, Savas Ozkan, Paul Wisbey, Anastasios Drosou, Mete Ozay

TL;DR

This work addresses real-time 3D full-body motion generation from sparse inputs for XR by introducing Mem-MLP, a lightweight MLP framework augmented with a Memory-Block that encodes missing sensor information using trainable code-vectors derived from a frozen VQ-VAE. It employs a two-branch multi-task predictor to jointly estimate joint rotations and global positions, optimized with a loss set that balances angle and velocity terms via homoscedastic uncertainty. On the AMASS-based benchmarks, Mem-MLP achieves state-of-the-art accuracy with a compact footprint (e.g., 0.25–0.38 GFLOPs) and real-time on-device performance (up to 72 FPS on Quest-3), significantly outperforming transformer/diffusion baselines in several metrics. The results demonstrate strong temporal coherence and practical viability for AR/VR applications with constrained sensing and compute.

Abstract

Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

TL;DR

This work addresses real-time 3D full-body motion generation from sparse inputs for XR by introducing Mem-MLP, a lightweight MLP framework augmented with a Memory-Block that encodes missing sensor information using trainable code-vectors derived from a frozen VQ-VAE. It employs a two-branch multi-task predictor to jointly estimate joint rotations and global positions, optimized with a loss set that balances angle and velocity terms via homoscedastic uncertainty. On the AMASS-based benchmarks, Mem-MLP achieves state-of-the-art accuracy with a compact footprint (e.g., 0.25–0.38 GFLOPs) and real-time on-device performance (up to 72 FPS on Quest-3), significantly outperforming transformer/diffusion baselines in several metrics. The results demonstrate strong temporal coherence and practical viability for AR/VR applications with constrained sensing and compute.

Abstract

Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

Paper Structure

This paper contains 19 sections, 5 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: On-device comparison of state-of-the-art methods for full-body motion generation. For each method, we report the performance by plotting inference time (FPS) against mean per-joint position error (MPJPE). All the timings are obtained by running the inference on the Quest-3 headset. Note that a lower MPJPE value indicates better accuracy, whereas a higher FPS value indicates better efficiency.
  • Figure 2: Given the sparse inputs of body joint representations, our model can accurately generate diverse motions in real-time on a head mounted device.
  • Figure 3: The architecture of our method. The details of our method is explained in Sec. \ref{['sec:Method']}.
  • Figure 4: The architecture of our memory-block component for the $l^{\text{th}}$ layer during training where $l=1,..,L/2$. Its details are explained in Sec. \ref{['sec:backbone']}.
  • Figure 5: Visualization of our results on the Quest-3 headset and the hand controllers. Our model is capable to generate diverse motions such as walking, sitting and jumping in real-time.
  • ...and 8 more figures