Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations
Firas Darwish, George Nicholson, Aiden Doherty, Hang Yuan
TL;DR
The paper tackles the sim-to-real challenge in pretraining motion representations for wearable HAR by using synthetic data generated from motion-capture-derived trajectories conditioned on text prompts. It adopts a UniMTS-style dual-encoder framework to learn text–motion representations and evaluates transfer to 18 HAR datasets under $0$-shot and $k$-shot regimes. Findings show that synthetic data helps generalisation when mixed with real data or scaled, but large-scale mocap pretraining yields only marginal gains due to domain mismatch with wearable signals. The work highlights the limitations of motion-capture as a surrogate for HAR and points to the need for domain-aligned synthetic generation and richer prompting to realize the full potential of synthetic data for transferable HAR representations.
Abstract
Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion, where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR), and where synthetic motion can be generated from motion-capture-derived representations. We pretrain motion time-series models using such synthetic data and evaluate their transfer across diverse downstream HAR tasks. Our results show that synthetic pretraining improves generalisation when mixed with real data or scaled sufficiently. We also demonstrate that large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals, clarifying key sim-to-real challenges and the limits and opportunities of synthetic motion data for transferable HAR representations.
