Table of Contents
Fetching ...

OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

Han Liang, Jiacheng Bao, Ruichi Zhang, Sihan Ren, Yuecheng Xu, Sibei Yang, Xin Chen, Jingyi Yu, Lan Xu

TL;DR

OMG tackles open-vocabulary motion generation from unseen text prompts by a two-stage framework. It pretrains a large unconditional diffusion model on over $20$ million motion frames to learn a rich motion manifold, then finetunes a conditional denoiser with text prompts using Motion ControlNet and a Mixture-of-Controllers to align text tokens with sub-motion ranges via cross-attention and token-specific experts. The Mixture-of-Controllers introduces a text-token-conditioned residual mechanism, guided by CLIP embeddings, to achieve precise, zero-shot text-to-motion alignment. Experiments on HumanML3D and Mixamo show state-of-the-art zero-shot performance in FID, CLIP-score, and text-motion alignment, with ablations confirming the contributions of large-scale pretraining, model scaling, and the MoC design. This work enables scalable open-vocabulary motion generation with broad implications for animation, robotics, and VR/AR applications.

Abstract

We have recently seen tremendous progress in realistic text-to-motion generation. Yet, the existing methods often fail or produce implausible motions with unseen text inputs, which limits the applications. In this paper, we present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the pretrain-then-finetune paradigm into the text-to-motion generation. At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits. To this end, we scale up a large unconditional diffusion model up to 1B parameters, so as to utilize the massive unlabeled motion data up to over 20M motion instances. At the subsequent fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information, through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. MoC block adaptively recognizes various ranges of the sub-motions with a cross-attention mechanism and processes them separately with the text-token-specific experts. Such a design effectively aligns the CLIP token embeddings of text prompts to various ranges of compact and expressive motion features. Extensive experiments demonstrate that our OMG achieves significant improvements over the state-of-the-art methods on zero-shot text-to-motion generation. Project page: https://tr3e.github.io/omg-page.

OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

TL;DR

OMG tackles open-vocabulary motion generation from unseen text prompts by a two-stage framework. It pretrains a large unconditional diffusion model on over million motion frames to learn a rich motion manifold, then finetunes a conditional denoiser with text prompts using Motion ControlNet and a Mixture-of-Controllers to align text tokens with sub-motion ranges via cross-attention and token-specific experts. The Mixture-of-Controllers introduces a text-token-conditioned residual mechanism, guided by CLIP embeddings, to achieve precise, zero-shot text-to-motion alignment. Experiments on HumanML3D and Mixamo show state-of-the-art zero-shot performance in FID, CLIP-score, and text-motion alignment, with ablations confirming the contributions of large-scale pretraining, model scaling, and the MoC design. This work enables scalable open-vocabulary motion generation with broad implications for animation, robotics, and VR/AR applications.

Abstract

We have recently seen tremendous progress in realistic text-to-motion generation. Yet, the existing methods often fail or produce implausible motions with unseen text inputs, which limits the applications. In this paper, we present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the pretrain-then-finetune paradigm into the text-to-motion generation. At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits. To this end, we scale up a large unconditional diffusion model up to 1B parameters, so as to utilize the massive unlabeled motion data up to over 20M motion instances. At the subsequent fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information, through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. MoC block adaptively recognizes various ranges of the sub-motions with a cross-attention mechanism and processes them separately with the text-token-specific experts. Such a design effectively aligns the CLIP token embeddings of text prompts to various ranges of compact and expressive motion features. Extensive experiments demonstrate that our OMG achieves significant improvements over the state-of-the-art methods on zero-shot text-to-motion generation. Project page: https://tr3e.github.io/omg-page.
Paper Structure (14 sections, 7 equations, 12 figures, 5 tables)

This paper contains 14 sections, 7 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Our Open-vocabulary Motion Generation (OMG) approach is capable of generating high-quality motions in response to unseen text prompts.
  • Figure 2: Method overview. We train our OMG model in two stages. First, we leverage large-scale unlabeled motion data to pre-train an unconditional diffusion model with up to 1B parameters (\ref{['sec:3.1']}). Then, we adopt a conditional fine-tuning scheme called motion ControlNet to condition the pre-trained diffusion model on text prompts (\ref{['sec:3.2']}). During inference, the pre-trained unconditional denoiser and the fine-tuned conditional denoiser are combined with classifier-free guidance, generating realistic motions with zero-shot text inputs.
  • Figure 3: Motion ControlNet (top) freezes the parameters of the pre-trained transformer layer and combines a trainable copy of the layer with the proposed Mixture-of-Controllers (bottom) block. The MoC block first fuses the text features and motion features and simultaneously determines the sub-motion ranges for each text token with the cross-attention mechanism. Then it performs fine-grained control of sub-motions with text-token-specific experts.
  • Figure 4: Qualitative results generated by our model given various unseen text prompts. Our model effectively captures the motion characteristics from either a single phrase or longer natural sentences.
  • Figure 5: Qualitative comparison. Our method can generate high-quality human motions that better align with text prompts than previous state-of-the-art methods.
  • ...and 7 more figures