Table of Contents
Fetching ...

RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation

Yuefan Cao, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

TL;DR

This work tackles the bottleneck in text-to-video generation where a single text embedding fails to capture complex, mixed features. It introduces an optimal interpolation embedding finder that uses a perpendicular-foot anchor and cosine similarity to identify the best intermediate embedding $E_{ ext{opt}}$ from two or more prompts, enabling the video model $f_ heta(E_{ ext{opt}}, z)$ to generate videos that blend prescribed features. The authors prove, theoretically, that word embedding spaces are insufficient to represent all possible videos, motivating embedding optimization; they also demonstrate empirically on CogVideoX-2B that their method yields higher subject consistency and better mixing of features, albeit with trade-offs in aesthetic realism. Overall, the paper provides both a principled framework and practical algorithm for enriching the prompt space via embedding interpolation, highlighting the central role of embedding design in expanding the capabilities of text-to-video generation.

Abstract

Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.

RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation

TL;DR

This work tackles the bottleneck in text-to-video generation where a single text embedding fails to capture complex, mixed features. It introduces an optimal interpolation embedding finder that uses a perpendicular-foot anchor and cosine similarity to identify the best intermediate embedding from two or more prompts, enabling the video model to generate videos that blend prescribed features. The authors prove, theoretically, that word embedding spaces are insufficient to represent all possible videos, motivating embedding optimization; they also demonstrate empirically on CogVideoX-2B that their method yields higher subject consistency and better mixing of features, albeit with trade-offs in aesthetic realism. Overall, the paper provides both a principled framework and practical algorithm for enriching the prompt space via embedding interpolation, highlighting the central role of embedding design in expanding the capabilities of text-to-video generation.

Abstract

Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
Paper Structure (25 sections, 7 theorems, 17 equations, 34 figures, 2 tables, 4 algorithms)

This paper contains 25 sections, 7 theorems, 17 equations, 34 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1.1

Let $n,d$ denote two integers, where $n$ denotes the maximum length of the sentence, and all videos are in $\mathbb{R}^d$ space. Let $V \in \mathbb{N}$ denote the vocabulary size. Let $\mathcal{U} = \{u_1, u_2, \cdots, u_V\}$ denote the word embedding space, where for $i \in [V]$, the word embedding

Figures (34)

  • Figure 1: Two kinds of Text Prompts Mixture. Left: Mixture of Two Prompts. We set two prompts, A and B, and apply linear interpolation to two corresponding text embeddings. After that, we use one of the interpolation results to generate a video. To evaluate the effect of video interpolation, we set another prompt C, which describes the generated video to generate a video to compare with the interpolated video. Right: Mixture of Three Prompts. We set two prompts A and B and apply linear interpolation to two corresponding text embeddings. We manually choose one text embedding interpolated from A and B, then apply linear interpolation to this text embedding and text embedding C. After that, we use one of the interpolation results to generate a video. To evaluate the effect of video interpolation, we set another prompt D which describes the generated video to generate a video to compare with the interpolated video.
  • Figure 2: Qualitative results of mixture of two features. Figure (a): Mixture of ["Tiger"] and ["Zebra"]; Figure (b): Mixture of ["Cat"] and ["Rabbit"]; Figure (c): Mixture of ["Sunflower"] and ["Snail"]. Our objective is to mix the features described in Prompt A and Prompt B with the guidance of Prompt C. We set the total number of interpolation steps to $30$. Using Algorithm \ref{['alg:find_optimal_interpolation']}, we identify the optimal embedding and generate the corresponding video. The video generated directly from Prompt C does not exhibit the desired mixed features from Prompts A and B.
  • Figure 3: Extending from two prompts mixture to three prompts mixture. Figure (a): Mixture of ["Strawberry"] and ["Blueberry"]. Figure (b): Mixture of ["Strawberry" + "Blueberry"] and ["Orange"]. We further apply Algorithm \ref{['alg:find_optimal_interpolation']} to that optimal embedding and Prompt C embedding, with the guidance of Prompt D. We identify $10$-th interpolation embedding as the optimal embedding of ["Strawberry" + "Blueberry"] and ["Orange"] and generate the corresponding video. The video generated directly from Prompt D does not exhibit the desired mixed features. Figure (c): Mixture of ["Tiger" + "Zebra"] and ["Giraffe"]. We present another example of a mixture of three prompts to demonstrate the effectiveness of our algorithm.
  • Figure 4: Mapping from Prompt Space to Video Space. This figure illustrates the mapping from a prompt space (with discrete prompts) to a video space (with continuous video embeddings) by a video generation model $f(x)$. Regardless of the specific form of the video generation model $f(x)$, there always exists a point in the video embedding space whose distance to all $f(x)$ is at least $\epsilon$.
  • Figure 5: Mixture of ["Tiger"] and ["Horse"]. Our objective is to mix the features described in Prompt A and Prompt B with the guidance of Prompt C. We set the total number of interpolation steps to $30$. Using Algorithm \ref{['alg:find_optimal_interpolation']}, we identify the $17$-th interpolation embedding as the optimal embedding and generate the corresponding video. The video generated directly from Prompt C does not exhibit the desired mixed features from Prompts A and B.
  • ...and 29 more figures

Theorems & Definitions (23)

  • Theorem 1.1: Word Embeddings being Insufficient to Represent for All Videos, informal version of Theorem \ref{['thm:any_function_bound_in_d_dimension']}
  • Definition 3.1: Finding Optimal Interpolation Embedding Problem
  • Definition 5.1: Linear Interpolation
  • Definition 5.2: Cosine Similarity Calculator
  • Definition 5.4: Attention Layer
  • Definition 5.5: Convolution Layer
  • Definition 5.6: Linear Projection
  • Definition 5.7: 3D Attention
  • Definition 5.8: Text-to-Video Generation Model
  • Theorem 5.9: Word Embeddings being Insufficient to Represent for All Videos, formal version of Theorem \ref{['thm:any_function_bound_in_d_dimension:informal']}
  • ...and 13 more