Table of Contents
Fetching ...

Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations

Firas Darwish, George Nicholson, Aiden Doherty, Hang Yuan

TL;DR

The paper tackles the sim-to-real challenge in pretraining motion representations for wearable HAR by using synthetic data generated from motion-capture-derived trajectories conditioned on text prompts. It adopts a UniMTS-style dual-encoder framework to learn text–motion representations and evaluates transfer to 18 HAR datasets under $0$-shot and $k$-shot regimes. Findings show that synthetic data helps generalisation when mixed with real data or scaled, but large-scale mocap pretraining yields only marginal gains due to domain mismatch with wearable signals. The work highlights the limitations of motion-capture as a surrogate for HAR and points to the need for domain-aligned synthetic generation and richer prompting to realize the full potential of synthetic data for transferable HAR representations.

Abstract

Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion, where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR), and where synthetic motion can be generated from motion-capture-derived representations. We pretrain motion time-series models using such synthetic data and evaluate their transfer across diverse downstream HAR tasks. Our results show that synthetic pretraining improves generalisation when mixed with real data or scaled sufficiently. We also demonstrate that large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals, clarifying key sim-to-real challenges and the limits and opportunities of synthetic motion data for transferable HAR representations.

Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations

TL;DR

The paper tackles the sim-to-real challenge in pretraining motion representations for wearable HAR by using synthetic data generated from motion-capture-derived trajectories conditioned on text prompts. It adopts a UniMTS-style dual-encoder framework to learn text–motion representations and evaluates transfer to 18 HAR datasets under -shot and -shot regimes. Findings show that synthetic data helps generalisation when mixed with real data or scaled, but large-scale mocap pretraining yields only marginal gains due to domain mismatch with wearable signals. The work highlights the limitations of motion-capture as a surrogate for HAR and points to the need for domain-aligned synthetic generation and richer prompting to realize the full potential of synthetic data for transferable HAR representations.

Abstract

Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion, where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR), and where synthetic motion can be generated from motion-capture-derived representations. We pretrain motion time-series models using such synthetic data and evaluate their transfer across diverse downstream HAR tasks. Our results show that synthetic pretraining improves generalisation when mixed with real data or scaled sufficiently. We also demonstrate that large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals, clarifying key sim-to-real challenges and the limits and opportunities of synthetic motion data for transferable HAR representations.
Paper Structure (19 sections, 5 equations, 3 figures, 3 tables)

This paper contains 19 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Modelling Human Motion. Human motion modelling relies on selecting a number of joints to track over a period of time, forming a discretised representation of human motion. This motion can be labelled and annotated with text. Using inverse kinematics, one can simulate accelerometer and gyroscope readings for each of those joints over time that can be used for wearables pretraining.
  • Figure 2: Fine-tuning stabilises gains from real-world pretraining. While increasing real-world pretraining data yields small but consistent improvements in fine-tuned ($k$-shot) performance, zero-shot performance remains unstable and does not scale reliably with data volume. Curves show results from three pretrained models per data scale, using the same set of random seeds at each percentage; shaded regions indicate standard error. Scores are mean macro-averaged F1 across downstream datasets.
  • Figure 3: Using the Text-to-Motion Model. Assuming text and motion samples are embedded and projected into a shared low-dimensional space for visualisation, where the black curve represents the text-to-motion model (shown as deterministic mapping here). In-distribution synthetic data is obtained by sampling the text-to-motion model using matching text inputs from our real-world data. Generating out-of-distribution synthetic data consists of sampling motion from text unseen in our real-world data.