Table of Contents
Fetching ...

LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model

Haowen Sun, Ruikun Zheng, Haibin Huang, Chongyang Ma, Hui Huang, Ruizhen Hu

TL;DR

LGTM tackles the persistent challenge of translating text into realistic, globally coherent 3D human motion by introducing a local-to-global diffusion framework. It employs a Partition Module that uses LLMs to decompose global motion descriptions into six part-level narratives, each processed by independent Part Motion Encoders, thereby reducing local semantic leakage. A subsequent Full-Body Motion Optimizer, based on attention, fuses these part-level latents with full-body semantics to enforce cross-part coordination and temporal consistency. Extensive qualitative and quantitative evaluations show LGTM achieves superior local semantic accuracy and global coherence compared with state-of-the-art baselines, highlighting the value of integrating LLM-driven partitioning with part-wise encoding and a global optimization stage. Limitations include reliance on ChatGPT for decomposition and potential ambiguity in prompts, with proposed future work aiming to incorporate VQ-VAE–style tokenization for finer motion tokens.

Abstract

In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs large language models (LLMs) to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate that LGTM gains significant improvements in generating locally accurate, semantically-aligned human motion, marking a notable advancement in text-to-motion applications. Code and data for this paper are available at https://github.com/L-Sun/LGTM

LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model

TL;DR

LGTM tackles the persistent challenge of translating text into realistic, globally coherent 3D human motion by introducing a local-to-global diffusion framework. It employs a Partition Module that uses LLMs to decompose global motion descriptions into six part-level narratives, each processed by independent Part Motion Encoders, thereby reducing local semantic leakage. A subsequent Full-Body Motion Optimizer, based on attention, fuses these part-level latents with full-body semantics to enforce cross-part coordination and temporal consistency. Extensive qualitative and quantitative evaluations show LGTM achieves superior local semantic accuracy and global coherence compared with state-of-the-art baselines, highlighting the value of integrating LLM-driven partitioning with part-wise encoding and a global optimization stage. Limitations include reliance on ChatGPT for decomposition and potential ambiguity in prompts, with proposed future work aiming to incorporate VQ-VAE–style tokenization for finer motion tokens.

Abstract

In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs large language models (LLMs) to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate that LGTM gains significant improvements in generating locally accurate, semantically-aligned human motion, marking a notable advancement in text-to-motion applications. Code and data for this paper are available at https://github.com/L-Sun/LGTM
Paper Structure (23 sections, 4 equations, 6 figures, 10 tables)

This paper contains 23 sections, 4 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of our LGTM framework, which consists of three major components. The partition module utilizes ChatGPT to deconstruct motion descriptions $T$ into body part level text $T_\mathrm{part}$, and decomposes full-body motion $\mathbf{M}$ to body part motion $\mathbf{M}_\mathrm{part}$;The part motion encoders encodes part-level motions with corresponding part-level text independently and a diffusion time step $n$;The full-body motion optimizer utilizes an attention module to optimize fused body part motion with full-body text semantic.
  • Figure 2: The structure of an attention encoder block.
  • Figure 3: Example results generated by our method.
  • Figure 4: Qualitative comparison of results generated by our method with those from MDM tevet2022HumanMotionDiffusion and MLD chen2023ExecutingYourCommandsa.
  • Figure 5: Motions generation by our method with and without the full-body optimizer for "a person walks upstairs, turns left, and walks back downstairs."
  • ...and 1 more figures