Table of Contents
Fetching ...

Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Sigal Raab, Inbar Gat, Nathan Sala, Guy Tevet, Rotem Shalev-Arkushin, Ohad Fried, Amit H. Bermano, Daniel Cohen-Or

TL;DR

MoMo presents a zero-shot, diffusion-based framework that leverages self-attention in pre-trained motion diffusion models to transfer a leader's motion outline onto a follower while preserving the follower's motifs. By injecting leader queries into follower keys/values during inference with a mixed-attention block, the method achieves motion outline transfer without training and enables out-of-distribution synthesis, style transfer, and spatial editing, using diffusion inversion to edit real and generated motions. The authors analyze the distinct roles of queries and keys in motion representations and validate the approach on the MTB benchmark, showing competitive or superior FID and R-precision compared with specialized baselines. The work advances motion editing by exploiting latent priors encoded in attention mechanisms, enabling flexible, inference-time manipulation with practical impact for animation, robotics, and related fields.

Abstract

Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for motion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Editing features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by how a monkey closely imitates what it sees while maintaining its unique motion patterns; hence we call it Monkey See, Monkey Do, and dub it MoMo. Employing our technique enables accomplishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training. Our webpage is at https://monkeyseedocg.github.io.

Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

TL;DR

MoMo presents a zero-shot, diffusion-based framework that leverages self-attention in pre-trained motion diffusion models to transfer a leader's motion outline onto a follower while preserving the follower's motifs. By injecting leader queries into follower keys/values during inference with a mixed-attention block, the method achieves motion outline transfer without training and enables out-of-distribution synthesis, style transfer, and spatial editing, using diffusion inversion to edit real and generated motions. The authors analyze the distinct roles of queries and keys in motion representations and validate the approach on the MTB benchmark, showing competitive or superior FID and R-precision compared with specialized baselines. The work advances motion editing by exploiting latent priors encoded in attention mechanisms, enabling flexible, inference-time manipulation with practical impact for animation, robotics, and related fields.

Abstract

Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for motion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Editing features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by how a monkey closely imitates what it sees while maintaining its unique motion patterns; hence we call it Monkey See, Monkey Do, and dub it MoMo. Employing our technique enables accomplishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training. Our webpage is at https://monkeyseedocg.github.io.
Paper Structure (26 sections, 7 equations, 7 figures, 5 tables)

This paper contains 26 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Motion transfer. The top row displays a leader performing a walking motion. The left column showcases sample frames of four followers, each engaged in a different motion. The central block presents the output motion, where the outline of the leader (e.g., leading leg) is transferred to the followers and integrated with their distinct motifs. Note the alignment of the steps for the leader and output motions. Our motion transfer is conducted by manipulating self-attention latent features in a zero-shot fashion.
  • Figure 2: The MoMo Pipeline. The input to our model is two noisy tensors, $X_T^{\text{ldr\xspace}}$ and $X_T^{\text{flw\xspace}}$, produced by either inverting real motions or sampling a Gaussian noise. The two tensors represent leader and follower motions, and are given along with their associated text prompts. We initialize our output motion, $X_T^{\text{out\xspace}}$, using the initial noise from the leader motion and pair it with the text prompt from the follower motion. The three noised motions $X_t^{\text{ldr\xspace}}$, $X_t^{\text{flw\xspace}}$ and $X_t^{\text{out\xspace}}$, are passed to the frozen denoising network at each timestep $t$, along with their prompts and with $t$. Within the denoising network, $X_t^{\text{out\xspace}}$ undergoes mixed-attention by combining the query from the leader motion with the key and value from the follower motion. Meanwhile, $X_t^{\text{ldr\xspace}}$ and $X_t^{\text{flw\xspace}}$ follow a standard diffusion process.
  • Figure 3: Dominant features in Q vs. K. Each row depicts two copies of the same motion, showcasing the K-Means clustering of its $Q$ and $K$ features in the left and right columns, respectively. Note how the features in $Q$ are dominated by the outline, while those in $K$ are dominated by the motion motifs. In the $Q$ column, periodic steps share clusters, ignoring unique patterns. In the $K$ column, clusters are related to motion motifs; thus, walking, turning while walking, and crouching while walking have distinct clusters. Temporal information is evident in the clusters of $Q$ but not in those of $K$. In the $Q$ column, the beginnings of the first two motions and the end of all three are highlighted by the colors of low and high frame numbers, respectively.
  • Figure 4: Correspondence via attention. Follower frames are color-coded according to consecutive indices (top row). Nearest neighbor follower frames (bottom) are the ones that achieve the highest mixed-attention ($Q^{\text{ldr\xspace}}\cdot K^{\text{flw\xspace}^T}$) activation, shown respectively to leader's frames (middle row). These correspondences are semantically aligned, e.g., moving "up" and "down" sub-motions are consistently assigned with follower moving "up" and "down" frames. Some of the nearest neighbors are highlighted with arrows.
  • Figure 5: Attention map per query. In the left column, we display three copies of the leader; in the right column, we show copies of the follower. The top copies depict the motions as they are, while the ones below highlight attention scores. We define two queries corresponding to different semantic temporal regions in the leader motion. Each query corresponds to a different pose, with varied arm direction or body stretch. Each motion column displays attention maps from a single layer, computed in different ways. In the left column, we present self-attention maps derived from queries and keys from the leader motion, causing each query to concentrate on semantically similar regions within that motion. The frame number related to each query is indicated with an arrow. For example, the query in frame 24 focuses on a pose of "standing low in an A pose", in the leader motion. However, frame number 24 corresponds to an entirely different pose in the follower motion in the right column. In the right column, we apply MoMo, aligning leader queries $Q^{\text{ldr\xspace}}$ with follower keys $K^{\text{flw\xspace}}$. This way we ensure that each query from the leader motion aligns with semantically similar regions of the follower motion. For instance, in frame 60, the query highlights the region where the character raises their arms. The frames with higher correspondence (red) in the right column also belong to characters raising their arms.
  • ...and 2 more figures