Seamless Human Motion Composition with Blended Positional Encodings

German Barquero; Sergio Escalera; Cristina Palmero

Seamless Human Motion Composition with Blended Positional Encodings

German Barquero, Sergio Escalera, Cristina Palmero

TL;DR

FlowMDM tackles the challenge of generating long, seamless human motion compositions conditioned on varying textual descriptions. It leverages a bidirectional diffusion framework with a Transformer denoiser and introduces Blended Positional Encodings ($APE$ early, $RPE$ later) to preserve global semantics while enabling smooth transitions, along with Pose-Centric Cross-Attention to handle multiple conditions without transition artifacts. The method also presents two jerk-based metrics, $PJ$ and $AUJ$, to better capture transition smoothness. Empirically, FlowMDM achieves state-of-the-art results on Babel and HumanML3D, offers robust single-description training with effective extrapolation, and provides practical guidance on scheduling, attention horizons, and guidance weights for high-quality motion compositions.

Abstract

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

Seamless Human Motion Composition with Blended Positional Encodings

TL;DR

early,

later) to preserve global semantics while enabling smooth transitions, along with Pose-Centric Cross-Attention to handle multiple conditions without transition artifacts. The method also presents two jerk-based metrics,

and

, to better capture transition smoothness. Empirically, FlowMDM achieves state-of-the-art results on Babel and HumanML3D, offers robust single-description training with effective extrapolation, and provides practical guidance on scheduling, attention horizons, and guidance weights for high-quality motion compositions.

Abstract

Paper Structure (19 sections, 4 equations, 10 figures, 9 tables)

This paper contains 19 sections, 4 equations, 10 figures, 9 tables.

Introduction
Related work
Methodology
Bidirectional diffusion
Blended positional encodings
Pose-centric cross-attention
Experiments
Experimental setup
Quantitative analysis
Qualitative results
Conclusion
Further implementation details
Evaluation details
More experimental results
Fine-grained comparison
...and 4 more sections

Figures (10)

Figure 1: We present FlowMDM, a diffusion-based approach capable of generating seamlessly continuous sequences of human motion from textual descriptions (left). The whole sequence is generated simultaneously and it does not require any postprocessing. FlowMDM also makes strides in the challenging problem of extrapolating and controlling periodic motion such as walking, jumping, or waving (right).
Figure 2: Attention scores of a single query pose (current frame) as a function of the pose attended to (x-axis) in a diffusion-based motion generation model with a sinusoidal absolute positional encoding. Curves show the scores at each denoising step. We observe that, whereas early steps show strong global dependencies (blue), later denoising stages exhibit a clearly local behavior (red).
Figure 3: Pose-centric cross-attention. Our attention minimizes the entanglement between the control signal (e.g., text, objects) and the noisy motion by feeding the former only to the query. Consequently, our model denoises each frame's noisy pose only leveraging its own condition, and the neighboring noisy poses.
Figure 4: Transitions smoothness. Average maximum jerk over joints at each frame of the transitions for both motion composition (left) and extrapolation (right) tasks. While other methods show severe smoothness artifacts in the beginning and end of their transition refinement processes, FlowMDM's jerk curve has the shortest peak for composition, and an absence of peaks for extrapolation.
Figure 5: BPE trade-offs. Increasing the number of APE steps undergone during BPE sampling improves the correspondence between motion and textual description (R-prec), but reduces the transition realism and smoothness (FID and AUJ). The best balance is reached around 10% of APE denoising steps.
...and 5 more figures

Seamless Human Motion Composition with Blended Positional Encodings

TL;DR

Abstract

Seamless Human Motion Composition with Blended Positional Encodings

Authors

TL;DR

Abstract

Table of Contents

Figures (10)