Table of Contents
Fetching ...

Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

Zillur Rahman, Alex Sheng, Cristian Meo

TL;DR

3R, a novel RAG based prompt optimization framework, is introduced, which utilizes the power of current state-of-the-art T2V diffusion model and vision language model to enable more accurate, efficient, and contextually aligned text-to-video generation.

Abstract

While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.

Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

TL;DR

3R, a novel RAG based prompt optimization framework, is introduced, which utilizes the power of current state-of-the-art T2V diffusion model and vision language model to enable more accurate, efficient, and contextually aligned text-to-video generation.

Abstract

While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.
Paper Structure (22 sections, 6 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of 3R pipeline. A short user prompt $I$ is used to extract a few relevant subject, scene, actions modifiers from a relation database $\mathcal{D}$. Then ${M}_{LLM}$ is used to merge those modifiers iteratively to the original user prompt to get detaild prompt $P_m$, and ${R}_{LLM}$ checks $P_m$ for any contradictory or missing information from the original prompt $I$, and generate $N$ refined prompts. The refined prompts are fed to a T2V base model $\mathcal{G}$ to generate initial videos for each prompt. Next, a video selection model selects the best candidate based on a question answering test, and a temporal interpolation network enhances temporal consistency of the final video.
  • Figure 2: Qualitative comparison of Lavie, IPO, and 3R in two common video generation failure modes. The left side shows prompts and video frames representing challenges in semantic alignment such as mushroom growing out of human head or zoom-in and the right side shows prompts and video frames representing challenges in addressing fictional references such as Darth Vedar or Pikachu Jedi.
  • Figure 3: Qualitative comparison of approaches in the common failure mode of generating videos containing text. We compare the first frame of the videos generated by Lavie (left), IPO (middle), and 3R (right) in the common failure mode of text generation in videos, as observed from prompts provided by the EvalCrafter benchmark. All three approaches show strong limitations in generating correct text, but 3R manages to generate qualitatively more legible text where the intended text in the prompt ("keep off the grass" or "keep off") can still be partially inferred despite typos. The prompts and respective video frames show how our approach can address prompts requiring multiple semantic conditions while producing less distorted outputs.