Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing
Ling-Hao Chen, Shunlin Lu, Wenxun Dai, Zhiyang Dou, Xuan Ju, Jingbo Wang, Taku Komura, Lei Zhang
TL;DR
MotionCLR introduces a CLR-based diffusion framework that explicitly models word-level cross-attention and frame-wise self-attention to enable fine-grained, training-free editing of text-driven human motion. By manipulating attention maps, it supports in-place replacement, (de-)emphasis, erasing, sequence shifting, example-based generation, and style transfer, while offering interpretability through attention visualizations. Quantitative and user studies show competitive generation quality and superior editing capabilities compared with baselines, along with promising action-count grounding and grounded motion generation. The work provides practical tools (web interface and Blender add-on) to integrate interactive motion editing into animation pipelines and highlights limitations like hallucinations and robustness to diverse prompts, outlining future directions for grounding and broader interactions.
Abstract
This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.
