Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
Sinan Mutlu, Georgios F. Angelis, Savas Ozkan, Paul Wisbey, Anastasios Drosou, Mete Ozay
TL;DR
This work addresses real-time 3D full-body motion generation from sparse inputs for XR by introducing Mem-MLP, a lightweight MLP framework augmented with a Memory-Block that encodes missing sensor information using trainable code-vectors derived from a frozen VQ-VAE. It employs a two-branch multi-task predictor to jointly estimate joint rotations and global positions, optimized with a loss set that balances angle and velocity terms via homoscedastic uncertainty. On the AMASS-based benchmarks, Mem-MLP achieves state-of-the-art accuracy with a compact footprint (e.g., 0.25–0.38 GFLOPs) and real-time on-device performance (up to 72 FPS on Quest-3), significantly outperforming transformer/diffusion baselines in several metrics. The results demonstrate strong temporal coherence and practical viability for AR/VR applications with constrained sensing and compute.
Abstract
Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.
