Table of Contents
Fetching ...

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo

TL;DR

This work addresses the burden of crafting prompts for text-to-video diffusion by introducing Prompt-A-Video, an LLM-driven, two-stage framework that learns video-centric prompts via a multi-dimensional reward system. It combines reward-guided evolution, supervised fine-tuning with LoRA, and direct preference optimization to align prompts with model preferences across diverse video-generation backbones. The approach yields consistent improvements in video quality, temporal coherence, and alignment on both in-domain and out-of-domain benchmarks, and extensions to text-to-image tasks demonstrate generalizability. The proposed automatic data pipeline and model-aware prompt refinement significantly reduce user effort while enhancing generation fidelity and coherence.

Abstract

Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

TL;DR

This work addresses the burden of crafting prompts for text-to-video diffusion by introducing Prompt-A-Video, an LLM-driven, two-stage framework that learns video-centric prompts via a multi-dimensional reward system. It combines reward-guided evolution, supervised fine-tuning with LoRA, and direct preference optimization to align prompts with model preferences across diverse video-generation backbones. The approach yields consistent improvements in video quality, temporal coherence, and alignment on both in-domain and out-of-domain benchmarks, and extensions to text-to-image tasks demonstrate generalizability. The proposed automatic data pipeline and model-aware prompt refinement significantly reduce user effort while enhancing generation fidelity and coherence.

Abstract

Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of the discrepancies between current prompt adaption frameworks and our proposed LLM-driven video prompt optimization system, namely Prompt-A-Video. Current prompt adaption frameworks predominantly focus on prompt systems tailored for images, collecting intrinsic modifier templates, or rely on the in-context capabilities of LLM. However, in the realm of text-to-video, these methods encounter challenges stemming from Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware. In this paper, we introduce Prompt-A-Video, a two-stage optimization and alignment system based on AI feedback, which aims to provide Video-Centric, Labor-Free and Preference-Aligned prompts.
  • Figure 2: The Pipeline of Reward-guided Prompt Evolution, which employs iterative reward-feedback loops to obtain superior prompts through three processes: evaluation, selection, and evolution. This evolutionary pipeline serves dual purposes: it not only optimizes video generation quality through iterative prompt refinement, but also creates training data pairs of original and refined prompts.
  • Figure 3: Win Rate of Prompt-A-Video versus original prompts and prompts refined by GPT-4o or GLM-4 with human evaluation. The user study is conducted on 100 prompts of VBench test set.
  • Figure 4: Videos generated using CogVideoX and Open-Sora 1.2 with user prompts and Prompt-A-Video.
  • Figure 5: Video Metrics for evolutionary prompts generated in each iteration. VQ: visual quality, TC: temporal consistency, AES: aesthetic predictor, MPS: Multi-dimensional Human Preference.
  • ...and 1 more figures