Table of Contents
Fetching ...

Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation

Ke Fan, Jiangning Zhang, Ran Yi, Jingyu Gong, Yabiao Wang, Yating Wang, Xin Tan, Chengjie Wang, Lizhuang Ma

TL;DR

This paper proposes to leverage the atomic motion as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem of the open-vocabulary motion generation.

Abstract

Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset's limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, $DSO$-Net, combines textual $d$ecomposition and sub-motion-space $s$cattering to solve the $o$pen-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation. Code is available at https://vankouf.github.io/DSONet/.

Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation

TL;DR

This paper proposes to leverage the atomic motion as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem of the open-vocabulary motion generation.

Abstract

Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset's limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, -Net, combines textual ecomposition and sub-motion-space cattering to solve the pen-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation. Code is available at https://vankouf.github.io/DSONet/.

Paper Structure

This paper contains 18 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Compared with current text-to-motion paradigms (Simple mapping, CLIP-based alignment, and Pretrain-then-Finetuning), our method proposes the textual decomposition to decompose the raw motion text into atomic texts and sub-motion-space scattering to learn the composition process from atomic motions to target motions, which significantly improves the ability of open-vocabulary motion generation.
  • Figure 2: The architecture of our entire framework. The overall pipeline adopts discrete generative modeling. 1) In the Motion Pre-Training stage (left blue part), we use the Residual VQ-VAE (RVQ) model, which designs a base layer and $R$ residual layers to learn layer-wise codebooks. By tokenizing the motion sequence into multi-layer discrete tokens, we learn the large-scale motion priors. 2) In the Motion Fine-tuning stage (right green part), we first leverage the large language model(LLM) and the fine-grained description conversion algorithm we design (only used in training stage) to perform texutal decomposition, which convert the raw text of a motion into the atomic texts. Then, for the base layer and residual layers in RVQ, we separately use generative mask modeling and a neural network with several Transformer layers to learn how to predict discrete motion tokens according to a given text. Furthermore, We design a text-motion alignment (TMA) module and a compositional feature fusion (CFF) module to learn the combinational process from atomic motions to the target motions.
  • Figure 3: Details of the compositional feature fusion (CFF) module, where the atomic text matrix is input into the TMA module for feature extraction, and is fused with the motion feature by cross-attention.
  • Figure 4: Comparison with several state-of-the-arts on open vocabulary texts.
  • Figure 5: Qualitative results compared with previous state-of-the-arts.
  • ...and 1 more figures