Table of Contents
Fetching ...

Infinite Motion: Extended Motion Generation via Long Text Instructions

Mengtian Li, Chengshuo Zhai, Shengxiang Yao, Zhifeng Xie, Keyu Chen, Yu-Gang Jiang

TL;DR

The paper tackles the challenge of long-duration, text-driven human motion generation by introducing the HumanML3D-Extend benchmark and a two-stage Infinite Motion framework. It combines latent-space diffusion for generating short motion segments from arbitrary-length text with a timestamp stitcher that seamlessly splices these segments into infinite sequences while preserving local timing and coherence. Key contributions include dataset expansion for long motions, a timestamp-based long-text processing mechanism, and a plug-and-play splicing module, all validated with extensive experiments showing improved long-sequence generation and editing capabilities. The work enables practical, flexible, and controllable long-motion synthesis with actionable editing and sequencing workflows, while acknowledging computational cost and data-consistency challenges as avenues for future work.

Abstract

In the realm of motion generation, the creation of long-duration, high-quality motion sequences remains a significant challenge. This paper presents our groundbreaking work on "Infinite Motion", a novel approach that leverages long text to extended motion generation, effectively bridging the gap between short and long-duration motion synthesis. Our core insight is the strategic extension and reassembly of existing high-quality text-motion datasets, which has led to the creation of a novel benchmark dataset to facilitate the training of models for extended motion sequences. A key innovation of our model is its ability to accept arbitrary lengths of text as input, enabling the generation of motion sequences tailored to specific narratives or scenarios. Furthermore, we incorporate the timestamp design for text which allows precise editing of local segments within the generated sequences, offering unparalleled control and flexibility in motion synthesis. We further demonstrate the versatility and practical utility of "Infinite Motion" through three specific applications: natural language interactive editing, motion sequence editing within long sequences and splicing of independent motion sequences. Each application highlights the adaptability of our approach and broadens the spectrum of possibilities for research and development in motion generation. Through extensive experiments, we demonstrate the superior performance of our model in generating long sequence motions compared to existing methods.Project page: https://shuochengzhai.github.io/Infinite-motion.github.io/

Infinite Motion: Extended Motion Generation via Long Text Instructions

TL;DR

The paper tackles the challenge of long-duration, text-driven human motion generation by introducing the HumanML3D-Extend benchmark and a two-stage Infinite Motion framework. It combines latent-space diffusion for generating short motion segments from arbitrary-length text with a timestamp stitcher that seamlessly splices these segments into infinite sequences while preserving local timing and coherence. Key contributions include dataset expansion for long motions, a timestamp-based long-text processing mechanism, and a plug-and-play splicing module, all validated with extensive experiments showing improved long-sequence generation and editing capabilities. The work enables practical, flexible, and controllable long-motion synthesis with actionable editing and sequencing workflows, while acknowledging computational cost and data-consistency challenges as avenues for future work.

Abstract

In the realm of motion generation, the creation of long-duration, high-quality motion sequences remains a significant challenge. This paper presents our groundbreaking work on "Infinite Motion", a novel approach that leverages long text to extended motion generation, effectively bridging the gap between short and long-duration motion synthesis. Our core insight is the strategic extension and reassembly of existing high-quality text-motion datasets, which has led to the creation of a novel benchmark dataset to facilitate the training of models for extended motion sequences. A key innovation of our model is its ability to accept arbitrary lengths of text as input, enabling the generation of motion sequences tailored to specific narratives or scenarios. Furthermore, we incorporate the timestamp design for text which allows precise editing of local segments within the generated sequences, offering unparalleled control and flexibility in motion synthesis. We further demonstrate the versatility and practical utility of "Infinite Motion" through three specific applications: natural language interactive editing, motion sequence editing within long sequences and splicing of independent motion sequences. Each application highlights the adaptability of our approach and broadens the spectrum of possibilities for research and development in motion generation. Through extensive experiments, we demonstrate the superior performance of our model in generating long sequence motions compared to existing methods.Project page: https://shuochengzhai.github.io/Infinite-motion.github.io/
Paper Structure (18 sections, 4 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Infinite motion: We propose a novel method for generating infinite motions, based on the timestamps featured in our HumanML3D-Extend dataset. This approach not only enables the generation of extremely long motions but also facilitates precise control over actions within specific time intervals.
  • Figure 2: Infinite Motion Pipeline: Our model consists of two stages. In Stage I, a diffusion process occurs in the latent space, simultaneously generating multiple segments of short motion sequences. In Stage II, the timestamp stitcher concatenates these short motion segments to form an infinite sequence of motions.
  • Figure 3: The data distribution of the HumanML3D-Extend dataset. Left: The horizontal axis represents the number of frames in a motion sequence. This chart shows the distribution of frame counts across various motion sequences. Middle: The horizontal axis represents the number of words in each text description. This chart shows the distribution of word counts in text descriptions. Right: The horizontal axis represents the number of actions in each motion sequence. This chart shows the distribution of action counts in each motion sequence.
  • Figure 4: Solve foot sliding issue: After the initial processing, there is a sliding issue with the foot position. Perform quadratic Bézier curve processing on the foot position to ensure it conforms to a normal human walking posture.
  • Figure 5: The process of timestamp insertion: The sequence lengths (frame[$i$], frame[$j$]) and corresponding text descriptions (text[$i$], text[$j$]) of two motion sequences are extracted. The sequence length of the preceding motion (frame[$i$]) is inserted as a timestamp at the junction of the two text descriptions. The subsequent timestamp is the sum of the lengths of the two motion sequences (frame[$i + j$]).
  • ...and 8 more figures