RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
Yuefan Cao, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
TL;DR
This work tackles the bottleneck in text-to-video generation where a single text embedding fails to capture complex, mixed features. It introduces an optimal interpolation embedding finder that uses a perpendicular-foot anchor and cosine similarity to identify the best intermediate embedding $E_{ ext{opt}}$ from two or more prompts, enabling the video model $f_ heta(E_{ ext{opt}}, z)$ to generate videos that blend prescribed features. The authors prove, theoretically, that word embedding spaces are insufficient to represent all possible videos, motivating embedding optimization; they also demonstrate empirically on CogVideoX-2B that their method yields higher subject consistency and better mixing of features, albeit with trade-offs in aesthetic realism. Overall, the paper provides both a principled framework and practical algorithm for enriching the prompt space via embedding interpolation, highlighting the central role of embedding design in expanding the capabilities of text-to-video generation.
Abstract
Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
