Table of Contents
Fetching ...

Optical-Flow Guided Prompt Optimization for Coherent Video Generation

Hyelin Nam, Jaemin Kim, Dohun Lee, Jong Chul Ye

TL;DR

MotionPrompt tackles temporal inconsistency in text-to-video diffusion by injecting on-the-fly prompt optimization guided by an optical-flow discriminator. The method appends learnable tokens to prompts and updates them during sampling, using gradients derived from a subset of frames to steer motion toward realistic dynamics without retraining the diffusion model. A composite loss combining discriminator-based flow realism, flow smoothness, and token-embedding regularization drives the optimization. Across Lavie, AnimateDiff, VideoCrafter2, and an image-to-video extension, MotionPrompt improves temporal coherence and motion realism with modest computational overhead, validated by qualitative, quantitative, and user-study results.

Abstract

While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.

Optical-Flow Guided Prompt Optimization for Coherent Video Generation

TL;DR

MotionPrompt tackles temporal inconsistency in text-to-video diffusion by injecting on-the-fly prompt optimization guided by an optical-flow discriminator. The method appends learnable tokens to prompts and updates them during sampling, using gradients derived from a subset of frames to steer motion toward realistic dynamics without retraining the diffusion model. A composite loss combining discriminator-based flow realism, flow smoothness, and token-embedding regularization drives the optimization. Across Lavie, AnimateDiff, VideoCrafter2, and an image-to-video extension, MotionPrompt improves temporal coherence and motion realism with modest computational overhead, validated by qualitative, quantitative, and user-study results.

Abstract

While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.

Paper Structure

This paper contains 34 sections, 13 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: MotionPrompt enhances temporal consistency and motion smoothness in text-to-video diffusion models by combining optical flow guidance with prompt optimization. It can be combined with a range of text-to-video diffusion models to produce visually coherent video sequences that closely align with intended motion while preserving content fidelity. Best viewed with Acrobat Reader. Click each image to play the video clip.
  • Figure 2: Overall pipeline of MotionPrompt. MotionPrompt enhances temporal consistency in text-to-video diffusion models by combining prompt optimization with an optical flow-based discriminator. Leveraging gradients from a subset of frames and aligning optical flow with real-world motion patterns, MotionPrompt efficiently generates videos with smooth, realistic motion and strong contextual coherence.
  • Figure 3: Qualitative comparison against three baselines. Additional results are provided in the supplementary material.
  • Figure 4: Cosine similarity between learnable and initial token embeddings. The cosine similarity decreases over time $t$, with more variation in embeddings observed for videos that initially exhibit lower subject consistency.
  • Figure 5: Comparison of video results generated by the vanilla DynamiCrafter model and our method.
  • ...and 7 more figures