Table of Contents
Fetching ...

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber

TL;DR

The paper tackles semantic video motion transfer from a motion reference to a target image by leveraging a frozen image-to-video diffusion model and a novel motion-text embedding. It introduces motion-textual inversion, which encodes motion as inflated motion-text tokens and uses cross-attention inflation to achieve high temporal granularity without requiring spatial alignment. The approach preserves target appearance, generalizes across domains and motion types, and outperforms baselines on Something-Something V2, with a user study supporting the findings. Practical trade-offs include per-motion optimization time on large GPUs and potential artifacts due to model priors and domain gaps.

Abstract

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context. Project website: https://mkansy.github.io/reenact-anything/

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

TL;DR

The paper tackles semantic video motion transfer from a motion reference to a target image by leveraging a frozen image-to-video diffusion model and a novel motion-text embedding. It introduces motion-textual inversion, which encodes motion as inflated motion-text tokens and uses cross-attention inflation to achieve high temporal granularity without requiring spatial alignment. The approach preserves target appearance, generalizes across domains and motion types, and outperforms baselines on Something-Something V2, with a user study supporting the findings. Practical trade-offs include per-motion optimization time on large GPUs and potential artifacts due to model priors and domain gaps.

Abstract

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context. Project website: https://mkansy.github.io/reenact-anything/
Paper Structure (60 sections, 4 equations, 19 figures, 5 tables)

This paper contains 60 sections, 4 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Observation 1. In image-to-video models, the image input primarily dictates the appearance of the generated videos. For example, I2VGen-XL i2vgen_xl generates a video of a predominantly white horse from a white horse image, even when the input text specifies the horse's color as "pink."
  • Figure 2: Observation 2. In image-to-video models, text/image embeddings significantly influence the generated motions. Swapping the CLIP clip image embeddings of a real horse and a toy horse in Stable Video Diffusion svd results in a swap of the motions in the output videos. This suggests that the real horse's embedding encodes a walking motion, while the toy horse's embedding encodes camera motion without object movement.
  • Figure 3: Method overview. The baseline image-to-video diffusion model, Stable Video Diffusion svd in our case, inputs the first frame in two places: as image (latent) concatenated with the noisy video and as image embedding (some other image-to-video diffusion models may input text embeddings here instead). We propose to replace the image embedding $\mathbf{e}$ (shown in red in the inference block) with a learned motion-text embedding $\mathbf{m}^*$ (green). The motion-text embedding is optimized directly with a regular diffusion model loss on one given motion reference video $\mathbf{x}_0$ while keeping the diffusion model frozen. For best results, the motion-text embedding is inflated prior to optimization to $(F+1) \times N$ tokens, where $F$ is the number of frames and $N$ is a hyperparameter, while keeping the embedding dimension $d$ the same to stay compatible with the pre-trained diffusion model. Note that the diffusion process operates in latent space in practice, and other conditionings and model parameterizations edm are omitted for clarity.
  • Figure 4: High-level visualization of our motion-text embedding and cross-attention inflation. The SVD svd UNet is composed of several levels of blocks, shown in gray, that have similar structure. We visualize the sub-blocks of level $i$ and their cross-attention maps in more detail. Our inflated motion-text embedding produces more meaningful cross-attention maps, resulting in improved motion learning. The cross-attention maps were extracted from the example of the woman doing jumping jacks in Fig. \ref{['fig:architecture']}.
  • Figure 5: Qualitative evaluation. We compare our method to SVD = Stable Video Diffusion svd (baseline, no motion input), VC = VideoComposer videocomposer, MC = MotionClone motionclone, and MD = MotionDirector motiondirector for three different motions and target images: full-body reenactment, face reenactment, and camera motion.
  • ...and 14 more figures