Table of Contents
Fetching ...

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

TL;DR

GuidedMotion addresses the challenge of text-to-motion generation by enabling fine-grained, local-action control over global motion. It introduces automatic local action sampling from semantic graphs, energy-based local-action diffusion guidance, and a three-level hierarchical diffusion model with CLIP-based node initialization and graph-attention-guided weighting. The approach yields improved controllability and motion diversity on HumanML3D and KIT, with ablations showing the local-action guidance and hierarchical structure driving performance gains. By shifting from direct global generation to a local-to-global paradigm, the method offers flexible, continuous control over trajectories and postures, with potential benefits for interactive animation and human-robot collaboration.

Abstract

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

TL;DR

GuidedMotion addresses the challenge of text-to-motion generation by enabling fine-grained, local-action control over global motion. It introduces automatic local action sampling from semantic graphs, energy-based local-action diffusion guidance, and a three-level hierarchical diffusion model with CLIP-based node initialization and graph-attention-guided weighting. The approach yields improved controllability and motion diversity on HumanML3D and KIT, with ablations showing the local-action guidance and hierarchical structure driving performance gains. By shifting from direct global generation to a local-to-global paradigm, the method offers flexible, continuous control over trajectories and postures, with potential benefits for interactive animation and human-robot collaboration.

Abstract

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.
Paper Structure (21 sections, 20 equations, 8 figures, 9 tables)

This paper contains 21 sections, 20 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Generating motion with diverse local actions. Different local actions correspond to distinct user preferences. Our method empowers users to combine preferred local actions freely, generating motions that align with their mental imagery. Furthermore, the combination of varied local actions enhances the motion diversity.
  • Figure 2: The overall framework of GuidedMotion for controllable text-to-motion generation. We propose to employ reference local actions as control signals in the global motion generation process. To automatically obtain these local actions, we deconstruct the original motion description into multiple local action descriptions and utilize a text-to-motion model to generate these local actions.
  • Figure 3: The model architecture of the hierarchical motion diffusion model. Utilizing the semantic graph as input, the hierarchical diffusion model dissects the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. To enhance generation stability, we exclusively implement local action guidance at the action level.
  • Figure 4: Qualitative comparisons. The darker colors indicate the later in time. The motions generated by our method closely align with the descriptions, outperforming others that exhibit degraded motions or improper semantics.
  • Figure 5: The distribution of the number of local actions in each motion. Motions typically consist of multiple local actions rather than just one local action.
  • ...and 3 more figures