Table of Contents
Fetching ...

InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

Wenjie Zhuo, Fan Ma, Hehe Fan

TL;DR

InfiniDreamer addresses arbitrarily long human motion generation by decoupling long-sequence synthesis from short-motion priors and refining with Segment Score Distillation (SSD). It initializes a long sequence, samples overlapping short segments with a sliding window, and optimizes each segment to align with a pre-trained motion diffusion prior, while ensuring smooth, coherent transitions via geometric losses. The approach yields improved coherence and contextual alignment on HumanML3D and BABEL without requiring long-sequence data, demonstrating robust long-horizon generation and offering insights into global consistency through overlapping optimization. This work enables practical, context-aware long-duration motion synthesis with potential impact on AR/VR, film, and animation pipelines, while indicating future gains from advances in short-motion diffusion priors.

Abstract

We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.

InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

TL;DR

InfiniDreamer addresses arbitrarily long human motion generation by decoupling long-sequence synthesis from short-motion priors and refining with Segment Score Distillation (SSD). It initializes a long sequence, samples overlapping short segments with a sliding window, and optimizes each segment to align with a pre-trained motion diffusion prior, while ensuring smooth, coherent transitions via geometric losses. The approach yields improved coherence and contextual alignment on HumanML3D and BABEL without requiring long-sequence data, demonstrating robust long-horizon generation and offering insights into global consistency through overlapping optimization. This work enables practical, context-aware long-duration motion synthesis with potential impact on AR/VR, film, and animation pipelines, while indicating future gains from advances in short-motion diffusion priors.

Abstract

We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.

Paper Structure

This paper contains 30 sections, 13 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Long motion sequences generated by InfiniDreamer. Given a list of textual prompts, our framework generates realistic human motions corresponding to each prompt, along with smooth and coherent transitions between them. This approach ultimately synthesizes a continuous, fluid long-duration human motion without requiring any long sequence data.
  • Figure 2: Overview of InfiniDreamer for arbitrarily long human motion generation. Given a list of text prompts, our framework generates a coherent and continuous long-sequence motion that aligns closely with each prompt. To achieve this, we start by initializing a long motion sequence using the (1) Motion Sequence Initialization module. Next, the (2) Motion Segment Sampling module iteratively samples short, overlapping sequence segments from the initialized motion. Finally, we refine each sampled segment with our proposed (3) Segment Score Distillation, optimizing each segment to align with the prior distribution of the pre-trained motion diffusion model. Through this iterative process, the framework synthesizes a seamless and fluid long-duration motion sequence, with realistic motions matching each prompt and smooth transitions connecting them.
  • Figure 3: Qualitative Comparisons to Baseline for Long Motion Generation. We present two examples: in the top row, our framework demonstrates strong segment transition capabilities, effectively generating a smooth jogging transition between two jogging motions. In contrast, the baseline produces a transitional segment with noticeable pauses. In the second row, we test a more complex and fine-grained example. The baseline method generates drifting motions, misses the segment "dodges something to their left", and introduces mismatched motion such as "crisscrossing". In comparison, our method produces a higher-quality sequence with enhanced fine-grained comprehension.
  • Figure 4: Ablation Study on Learning Rate $\eta$. We experiment with different $\eta$ and find that an excessively high learning rate leads to motion stillness (i.e., motion lost), while a lower learning rate results in large noise disturbances, causing motion distortions.
  • Figure 5: Qualitative Comparisons to FlowMDM for Long Motion Generation. We present two examples: in the top row, our framework demonstrates strong contextual understanding, guiding the transition segment to "go upstairs" in response to the following "downstairs" prompt. In contrast, FlowMDM shows slightly motion drift in this segment. In the bottom row, we use a more fine-grained textual prompt, where the FlowMDM exhibits issues with motion drift and semantic errors, failing to generate the "side steps" segment. Our framework, however, produces a higher-quality sequence with enhanced fine-grained comprehension of the text.
  • ...and 1 more figures