Table of Contents
Fetching ...

POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation

Shijie Ma, Huayi Xu, Mengjian Li, Weidong Geng, Yaxiong Wang, Meng Wang

TL;DR

POS targets instability in diffusion-based text-to-video generation by addressing two input prompts: the noise and the text. It introduces an optimal noise approximator that either retrieves a neighbor video and inverts it to approach the best matching noise, or trains a Noise Prediction Network to predict this noise directly, with a Gaussian mixture to preserve diversity. It also presents a semantic-preserving rewriter that combines reference-guided rewriting with a hybrid denoising strategy to enrich prompts without straying from the original semantics. Across multiple backbones and standard benchmarks, POS yields clear improvements in quality and semantic alignment, validating a model-agnostic, optimization-free approach that can be readily integrated into existing diffusion-based pipelines.

Abstract

This paper targets to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. Accommodated with this goal, we propose POS, a training-free Prompt Optimization Suite to boost text-to-video models. POS is motivated by two observations: (1) Video generation shows instability in terms of noise. Given the same text, different noises lead to videos that differ significantly in terms of both frame quality and temporal consistency. This observation implies that there exists an optimal noise matched to each textual input; To capture the potential noise, we propose an optimal noise approximator to approach the potential optimal noise. Particularly, the optimal noise approximator initially searches a video that closely relates to the text prompt and then inverts it into the noise space to serve as an improved noise prompt for the textual input. (2) Improving the text prompt via LLMs often causes semantic deviation. Many existing text-to-vision works have utilized LLMs to improve the text prompts for generation enhancement. However, existing methods often neglect the semantic alignment between the original text and the rewritten one. In response to this issue, we design a semantic-preserving rewriter to impose contraints in both rewritng and denoising phrases to preserve the semantic consistency. Extensive experiments on popular benchmarks show that our POS can improve the text-to-video models with a clear margin. The code will be open-sourced.

POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation

TL;DR

POS targets instability in diffusion-based text-to-video generation by addressing two input prompts: the noise and the text. It introduces an optimal noise approximator that either retrieves a neighbor video and inverts it to approach the best matching noise, or trains a Noise Prediction Network to predict this noise directly, with a Gaussian mixture to preserve diversity. It also presents a semantic-preserving rewriter that combines reference-guided rewriting with a hybrid denoising strategy to enrich prompts without straying from the original semantics. Across multiple backbones and standard benchmarks, POS yields clear improvements in quality and semantic alignment, validating a model-agnostic, optimization-free approach that can be readily integrated into existing diffusion-based pipelines.

Abstract

This paper targets to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. Accommodated with this goal, we propose POS, a training-free Prompt Optimization Suite to boost text-to-video models. POS is motivated by two observations: (1) Video generation shows instability in terms of noise. Given the same text, different noises lead to videos that differ significantly in terms of both frame quality and temporal consistency. This observation implies that there exists an optimal noise matched to each textual input; To capture the potential noise, we propose an optimal noise approximator to approach the potential optimal noise. Particularly, the optimal noise approximator initially searches a video that closely relates to the text prompt and then inverts it into the noise space to serve as an improved noise prompt for the textual input. (2) Improving the text prompt via LLMs often causes semantic deviation. Many existing text-to-vision works have utilized LLMs to improve the text prompts for generation enhancement. However, existing methods often neglect the semantic alignment between the original text and the rewritten one. In response to this issue, we design a semantic-preserving rewriter to impose contraints in both rewritng and denoising phrases to preserve the semantic consistency. Extensive experiments on popular benchmarks show that our POS can improve the text-to-video models with a clear margin. The code will be open-sourced.
Paper Structure (14 sections, 8 equations, 6 figures, 4 tables)

This paper contains 14 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Different noises can yield significantly different videos in terms of quality. With this observation, we posit there exists a potential optimal noise (orange circle), randomly sampled noise close to the optimal noise can synthesize high-quality results, while the noise far away leads to poor quality. Both videos are produced by ModelScope wang2023modelscope with the same prompt "A dog wearing a Superhero outfit with red cape flying through the sky".
  • Figure 2: Motivation illustration of optimal noise approximator. The trained denoising and inversion functions establish a bidirectional mapping between the video space and the noise space. Treating the inversion of the groundtruth video ("GT noise") as the optimal noise, our objective is to approximate this optimal noise by inverting video neighbors. It is observed that similar videos converge to a confined region within the noise space, forming the theoretical basis for our optimal noise approximator.
  • Figure 3: Illustration of our POS. Given a trained text-to-video model, POS enhances it by improving the two types of prompts: the noise and the text. The optimal noise approximator targets to approach the optimal noise for the text prompt, while the semantic-preserving rewriter, formed by the reference-guided rewriting and the denoising with hybrid semantics, improves the text prompt by providing more details without deviating from the original semantics.
  • Figure 4: Architecture of Noise Prediction Network. During training, we invert real videos into the noise space, pairing them with their corresponding text descriptions to form training pairs. At inference time, the text prompt is directly input into the trained network to yield the optimal noise prediction.
  • Figure 5: Qualitative results.$\text{POS}_{\text{ModelScope}}$ means ModelScope with POS equipped, subfigures (a) and (b) show the results with SCVideo and ModelScope as backbones. Each group shares the same random noise for a fair comparison.
  • ...and 1 more figures