InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation
Wenjie Zhuo, Fan Ma, Hehe Fan
TL;DR
InfiniDreamer addresses arbitrarily long human motion generation by decoupling long-sequence synthesis from short-motion priors and refining with Segment Score Distillation (SSD). It initializes a long sequence, samples overlapping short segments with a sliding window, and optimizes each segment to align with a pre-trained motion diffusion prior, while ensuring smooth, coherent transitions via geometric losses. The approach yields improved coherence and contextual alignment on HumanML3D and BABEL without requiring long-sequence data, demonstrating robust long-horizon generation and offering insights into global consistency through overlapping optimization. This work enables practical, context-aware long-duration motion synthesis with potential impact on AR/VR, film, and animation pipelines, while indicating future gains from advances in short-motion diffusion priors.
Abstract
We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.
