Table of Contents
Fetching ...

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause, Ming Gui, Björn Ommer

TL;DR

<3-5 sentence high-level summary>

Abstract

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

TL;DR

<3-5 sentence high-level summary>

Abstract

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo

Paper Structure

This paper contains 37 sections, 6 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Motion transfer examples enabled by our abstract motion representations. We extract abstract motion representations from driving videos and transfer them onto new content, represented either by source images (Left) and (Middle), or text prompts (Right).
  • Figure 2: Method Overview.(a) During training, our motion extractor $\mathcal{M}_\theta$ receives augmented frames from a video $\mathbf{X}$, along with additional motion query tokens $\mathbf{Q}$. These are then individually passed to the frame generator $\mathcal{F}_\psi$, alongside the corresponding source frame $\mathbf{x}_t$, from which it learns to reconstruct a frame at a future timestep ${t+{\Delta_t}}$. (b) To transfer a motion sequence $\mathbf{M}$ onto another target image, we can directly utilize the trained Frame Generator $\mathcal{F}_{\psi}$ autoregressively as a low-cost option. (c) For high-quality motion transfer, we adapt pre-trained off-the-shelf video generation models. A motion sequence is first embedded using a mapping network, before being introduced to the frozen video model. The processed motion sequence is arranged such that each token at timestep $t$ in the pre-trained backbone receives conditioning only from the temporally corresponding motion embedding $\mathbf{m}_t$.
  • Figure 3: Qualitative motion transfer comparison between (a) our auto-regressive frame generator and (b) an adapted video model. We transfer motion extracted from a driving video onto a new target image using each model. While both approaches manage to transfer high-level motion semantics in a view- and appearance-invariant manner, the adapted video model achieves higher generation fidelity.
  • Figure 4: Motion transfer examples in different settings. We compare different motion transfer methods on three settings: (Left) Inter-category motion transfer. (Middle) An example showcasing camera motion. (Right) Composed motion transfer.
  • Figure A: UMAP visualization of the IARD dataset. Compared to V-JEPA, our model shows better grouping by action and almost no grouping by identity.
  • ...and 13 more figures