Free3D: 3D Human Motion Emerges from Single-View 2D Supervision
Sheng Liu, Yuanzhi Liang, Sidan Du
TL;DR
Free3D tackles the generalization gap in 3D human motion generation by removing the dependence on ground-truth 3D annotations and learning from single-view 2D motion data. It introduces a Motion Lifting Residual Quantized VAE (ML-RQ-VAE) and a suite of 3D-free regularizations that enforce view consistency, orientation coherence, and feature-level alignment, enabling 3D lift from 2D cues. Through extensive experiments on HumanML3D and KIT-ML, Free3D achieves competitive or superior performance to 3D-supervised baselines, with a high QM score that fuses realism and diversity. This work demonstrates that relaxing explicit 3D supervision can foster stronger structural reasoning and yield scalable, annotation-efficient 3D motion synthesis.
Abstract
Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.
