Table of Contents
Fetching ...

Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing

Ling-Hao Chen, Shunlin Lu, Wenxun Dai, Zhiyang Dou, Xuan Ju, Jingbo Wang, Taku Komura, Lei Zhang

TL;DR

MotionCLR introduces a CLR-based diffusion framework that explicitly models word-level cross-attention and frame-wise self-attention to enable fine-grained, training-free editing of text-driven human motion. By manipulating attention maps, it supports in-place replacement, (de-)emphasis, erasing, sequence shifting, example-based generation, and style transfer, while offering interpretability through attention visualizations. Quantitative and user studies show competitive generation quality and superior editing capabilities compared with baselines, along with promising action-count grounding and grounded motion generation. The work provides practical tools (web interface and Blender add-on) to integrate interactive motion editing into animation pipelines and highlights limitations like hallucinations and robustness to diverse prompts, outlining future directions for grounding and broader interactions.

Abstract

This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.

Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing

TL;DR

MotionCLR introduces a CLR-based diffusion framework that explicitly models word-level cross-attention and frame-wise self-attention to enable fine-grained, training-free editing of text-driven human motion. By manipulating attention maps, it supports in-place replacement, (de-)emphasis, erasing, sequence shifting, example-based generation, and style transfer, while offering interpretability through attention visualizations. Quantitative and user studies show competitive generation quality and superior editing capabilities compared with baselines, along with promising action-count grounding and grounded motion generation. The work provides practical tools (web interface and Blender add-on) to integrate interactive motion editing into animation pipelines and highlights limitations like hallucinations and robustness to diverse prompts, outlining future directions for grounding and broader interactions.

Abstract

This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.

Paper Structure

This paper contains 56 sections, 3 equations, 32 figures, 11 tables.

Figures (32)

  • Figure 1: We propose MotionCLR, supporting interactive motion generation and versatile editing. The blue and red characters represent original and edited motions. (A) Motion deemphasis and emphasis via adjusting the weight of "jump". (B) In-place replacing the action of "runs" with "jumps". (C) Transferring motion style referring to two motions. From left to right, there are motion style reference, motion texture reference, and transferred motion. (D) Shifting the order of "walking" and "sitting" actions in a motion. (E) Generating diverse motion with the same example motion, a.k.a. example-based motion generation or crowd animation. The left crowd is boxing animation and the right crowd is kicking animation.
  • Figure 2: System overview of MotionCLR architecture. (a) The U-Net-like denoising network is with two CLR blocks before down/up-sampling. (b) The basic CLR block includes four layers, separating the timestep injection and the text condition. (c) The key component is the text-motion cross-attention at the word level.
  • Figure 3: Empirical study of attention mechanisms. We use "a person jumps." as an example. (A) Keyframes and the root trajectory of generated motion. The character jumps on $\sim 15-40$f, $\sim 60-80$f, and $\sim 125-145$f, respectively. (B) The cross-attention map between timesteps and words. The "jump" word is highly activated aligning with the "jump" action. (d) The self-attention map visualization. It is obvious that the character jumps three times, reflecting nine areas in the self-attention map. Different jumps share similar local motion patterns.
  • Figure 4: Diagram of motion editing via manipulating attention maps.
  • Figure 5: In-place motion replacement. (a) and (b) are a pair of motions before and after editing. (c) is a comparison of original and edited motions.
  • ...and 27 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2