Table of Contents
Fetching ...

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

Nhat M. Hoang, Kehong Gong, Chuan Guo, Michael Bi Mi

TL;DR

MotionMix addresses the challenge of controllable 3D human motion generation with limited high-quality annotated data by introducing a weakly-supervised diffusion framework that leverages both noisy annotated and unannotated motion sequences. It partitions the diffusion process with a denoising pivot $T^*$, performing conditional refinement in early steps and unconditional refinement later, enabling robust generation from imperfect data. Across text-to-motion, action-to-motion, and music-to-dance tasks, MotionMix with multiple backbones achieves state-of-the-art or competitive results and demonstrates strong data-efficiency through comprehensive ablations. This approach unlocks scalable motion synthesis by exploiting abundant real-world motion resources while maintaining high generation quality and controllability.

Abstract

Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial $T-T^*$ steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last $T^*$ steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks. Project page: https://nhathoang2002.github.io/MotionMix-page/

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

TL;DR

MotionMix addresses the challenge of controllable 3D human motion generation with limited high-quality annotated data by introducing a weakly-supervised diffusion framework that leverages both noisy annotated and unannotated motion sequences. It partitions the diffusion process with a denoising pivot , performing conditional refinement in early steps and unconditional refinement later, enabling robust generation from imperfect data. Across text-to-motion, action-to-motion, and music-to-dance tasks, MotionMix with multiple backbones achieves state-of-the-art or competitive results and demonstrates strong data-efficiency through comprehensive ablations. This approach unlocks scalable motion synthesis by exploiting abundant real-world motion resources while maintaining high generation quality and controllability.

Abstract

Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks. Project page: https://nhathoang2002.github.io/MotionMix-page/
Paper Structure (22 sections, 3 equations, 3 figures, 7 tables)

This paper contains 22 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Examples of applying MotionMix on text-to-motion generation. Unlike previous works, our training data are only comprised of noisy annotated motions and unannotated motions. https://nhathoang2002.github.io/MotionMix-page/
  • Figure 2: (Left) Training Process. The model is trained with a mixture of noisy and clean data. A noise timestep in ranges of $[1, T^*]$ and $[T^*+1, T]$ is sampled respectively for each clean and noisy data. Here, $T^*$ is a denoising pivot that determines the starting point from which the diffusion model refines the noisy motion sequences into clean ones without any guidance. (Right) Sampling Process. The sampling process consists of two stages. In Stage-1 from timestep $T$ to $T^*+1$, the model generates the rough motion approximations, guided by the conditional input $c$. In Stage-2 from timestep $T^*$ to $1$, the model refines these approximations to high-quality motion sequences while the input $c$ is masked.
  • Figure 3: Qualitative performance of baseline MDM and MotionDiffuse models, trained exclusively on high-quality annotated data, with our MotionMix approach, which learns from imperfect data sources. Their visualized motion results are presented alongside real references for three distinct text prompts. Please refer to supplementary files for more animations.