Table of Contents
Fetching ...

OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

Mathis Koroglu, Hugo Caselles-Dupré, Guillaume Jeanneret Sanmiguel, Matthieu Cord

TL;DR

OnlyFlow addresses the challenge of motion-controllable text-to-video generation by conditioning a diffusion-based video model on optical flow extracted from an input video. A trainable optical flow encoder Phi processes the flow and injects multi-scale features into the temporal attention blocks of a frozen AnimateDiff backbone, with a controllable strength parameter gamma guiding motion influence. Across quantitative metrics (FVD, flow fidelity, CLIP alignment) and user studies, OnlyFlow demonstrates effective motion transfer and prompt fidelity, offering a lightweight, flexible approach that extends to V2V editing and camera-like movements without task-specific retraining. Limitations include photorealism and resolution constraints inherent to the base model, and future work could explore alternative motion signals to better separate camera and object motion while expanding motion-conditioned generation capabilities.

Abstract

We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

TL;DR

OnlyFlow addresses the challenge of motion-controllable text-to-video generation by conditioning a diffusion-based video model on optical flow extracted from an input video. A trainable optical flow encoder Phi processes the flow and injects multi-scale features into the temporal attention blocks of a frozen AnimateDiff backbone, with a controllable strength parameter gamma guiding motion influence. Across quantitative metrics (FVD, flow fidelity, CLIP alignment) and user studies, OnlyFlow demonstrates effective motion transfer and prompt fidelity, offering a lightweight, flexible approach that extends to V2V editing and camera-like movements without task-specific retraining. Limitations include photorealism and resolution constraints inherent to the base model, and future work could explore alternative motion signals to better separate camera and object motion while expanding motion-conditioned generation capabilities.

Abstract

We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

Paper Structure

This paper contains 36 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: OnlyFlow controls the generation of video with text and motion of a video input, synthetically generated or not. We strongly encourage readers to check our supplemental content for video results that are not well represented by still images.
  • Figure 2: Overview of OnlyFlow. Inputs are i) a tokenized and encoded text prompt, ii) noisy latents for the diffusion model and iii) the optical flow of an input video. The latter is fed through a trainable optical flow encoder which outputs features maps that are injected in the diffusion U-net. We experiment with several injection strategies, for illustration purposes we only show the injection in temporal attention layers of the U-net. The U-net is kept frozen during training. The output generated video matches the input prompt and motion.
  • Figure 3: Injection strategy of the encoded optical flow conditioning $c_k$ from the optical flow encoder into the temporal attention layers of the $k$-th block of the U-net.
  • Figure 4: Metrics computed on OnlyFlow-generated videos for both feature maps injection strategies (OnlyFlow-T (orange) and OnlyFlow-ST (blue).) for different values of $\gamma$. The motion realism depicted by fig.(a) improves with $\gamma$, as well as the optical flow similarity in fig.(b). Fig.(c) shows us that the CLIP score does not deteriorate and therefore that the video respect the prompt just as well
  • Figure 5: Qualitative comparison of video-to-video generation models. Videos are generated using same text prompt and input video. OnlyFlow exhibits a superior combination of motion fidelity and image realism. It positively compares to approaches that use depth maps (RAVE, VideoComposer, Control-A-Video), is comparable to Gen-1 gen1's temporal coherence and VideoComposer videocomposer's image quality.
  • ...and 4 more figures