Table of Contents
Fetching ...

Free3D: 3D Human Motion Emerges from Single-View 2D Supervision

Sheng Liu, Yuanzhi Liang, Sidan Du

TL;DR

Free3D tackles the generalization gap in 3D human motion generation by removing the dependence on ground-truth 3D annotations and learning from single-view 2D motion data. It introduces a Motion Lifting Residual Quantized VAE (ML-RQ-VAE) and a suite of 3D-free regularizations that enforce view consistency, orientation coherence, and feature-level alignment, enabling 3D lift from 2D cues. Through extensive experiments on HumanML3D and KIT-ML, Free3D achieves competitive or superior performance to 3D-supervised baselines, with a high QM score that fuses realism and diversity. This work demonstrates that relaxing explicit 3D supervision can foster stronger structural reasoning and yield scalable, annotation-efficient 3D motion synthesis.

Abstract

Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.

Free3D: 3D Human Motion Emerges from Single-View 2D Supervision

TL;DR

Free3D tackles the generalization gap in 3D human motion generation by removing the dependence on ground-truth 3D annotations and learning from single-view 2D motion data. It introduces a Motion Lifting Residual Quantized VAE (ML-RQ-VAE) and a suite of 3D-free regularizations that enforce view consistency, orientation coherence, and feature-level alignment, enabling 3D lift from 2D cues. Through extensive experiments on HumanML3D and KIT-ML, Free3D achieves competitive or superior performance to 3D-supervised baselines, with a high QM score that fuses realism and diversity. This work demonstrates that relaxing explicit 3D supervision can foster stronger structural reasoning and yield scalable, annotation-efficient 3D motion synthesis.

Abstract

Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.

Paper Structure

This paper contains 32 sections, 16 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: MultiModality-FID comparisons of different methods on HumanML3D dataset. 3D-supervised methods, such as MDM tevet2023human, tend to achieve low FID but relatively high MultiModality, whereas MoMask guo2024momask exhibits the opposite trend. Free3D, trained solely with single-view 2D supervision, relaxes the constraints of precise 3D signals. This encourages the model to learn true motion structure and semantic patterns, enabling it to achieve both high MultiModality and low FID simultaneously. The point size of each method represents its Quality–Multimodality Score (QM), which jointly reflects FID and MultiModality through an integrated measure detailed in the experimental section.
  • Figure 2: The pipeline of our method. Free3D can generate consistent and physically plausible 3D motions without relying on any 3D annotations, using only single-view 2D motion supervision. The decoupled global trajectory and 2D motion sequences are first encoded into a latent space via the Motion Lifting Residual Quantized VAE (ML-RQ) and then decoded into 3D motion. A set of 3D-free regularization objectives—including view-consistent projection losses, view-invariant priors through random rotations, orientation regularization, and feature-level consistency—is applied to enforce geometric stability, temporal coherence, and semantic alignment of the generated motions. Similar to MoMask guo2024momask, motion generation is performed in the quantized latent space produced by the encoder.
  • Figure 3: Qualitative comparison on HumanML3D dataset. Compared with MoMask guo2024momask, our method demonstrates better generalization and a more accurate understanding of textual semantics, producing motions that more closely align with the given text descriptions.
  • Figure 4: Qualitative Results on In-the-Wild Motions. Free3D is capable of learning from and generating motions extracted from in-the-wild 2D videos. We selected several actions that cannot be physically performed or captured with 3D motion sensors under existing conditions, extracted their 2D motion sequences, and used them for supervision. Free3D successfully generates high-quality 3D motions for these challenging scenarios, despite the absence of 3D annotations.