Table of Contents
Fetching ...

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

Diljeet Jagpal, Xi Chen, Vinay P. Namboodiri

TL;DR

EIDT-V presents a training-free, model-agnostic framework for text-to-video generation by exploiting intersections in diffusion trajectories and applying grid-based, text-guided prompt switching. It leverages two LLM modules to produce framewise prompts and detect inter-frame differences, complemented by a CLIP-based attention mask to schedule prompt switches regionally, balancing coherence and variance. The approach achieves competitive temporal coherence and visual fidelity across multiple diffusion backbones (SD1.5, SDXL, SD3), with extensive ablations and a human user study validating perceptual quality and user satisfaction. By operating entirely in latent space and avoiding architecture-level changes, EIDT-V offers a scalable, accessible path to high-quality video synthesis that adapts to evolving diffusion models. The work underscores the potential of combining diffusion trajectory theory, grid-based conditioning, and language-guided cues to advance training-free video generation at practical costs.

Abstract

Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

TL;DR

EIDT-V presents a training-free, model-agnostic framework for text-to-video generation by exploiting intersections in diffusion trajectories and applying grid-based, text-guided prompt switching. It leverages two LLM modules to produce framewise prompts and detect inter-frame differences, complemented by a CLIP-based attention mask to schedule prompt switches regionally, balancing coherence and variance. The approach achieves competitive temporal coherence and visual fidelity across multiple diffusion backbones (SD1.5, SDXL, SD3), with extensive ablations and a human user study validating perceptual quality and user satisfaction. By operating entirely in latent space and avoiding architecture-level changes, EIDT-V offers a scalable, accessible path to high-quality video synthesis that adapts to evolving diffusion models. The work underscores the potential of combining diffusion trajectory theory, grid-based conditioning, and language-guided cues to advance training-free video generation at practical costs.

Abstract

Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.

Paper Structure

This paper contains 53 sections, 8 equations, 31 figures, 5 tables.

Figures (31)

  • Figure 1: EIDT-V Pipeline for Frame-Based Video Generation. The pipeline consists of two primary modules: text and video. The text module converts the user’s input into framewise prompts and expected variations, which guide the video module in generating frames iteratively. The video module achieves controlled variance and coherence across frames by leveraging trajectory intersections. Integrating two LLM modules and the grid prompt switching enables a generic image diffusion model to synthesize coherent video sequences effectively.
  • Figure 2: Grid Prompt Switching with Text-Guided Attention. On the right, an auxiliary block processes the previous frame and difference text using a CLIP segmentation model to generate the STM. This is converted into a binary mask (\ref{['eq:mask']}) that selects, at timestep $t$, whether each grid cell follows the original or new prompt trajectory. In the main denoising process, the mask blends the latent representations $X_t^{(A)}$ and $X_t^{(B)}$ as per \ref{['eq:composite_refined']} to form $X_t$. While a coarse $3 \times 3$ grid is shown for clarity, in practice, higher-resolution masks (e.g., $128 \times 128$) are used.
  • Figure 3: Qualitative comparison of different video generation models across three prompts: (a) "A chameleon changing colors on a branch", (b) "A horse galloping across a field", and (c) "An astronaut floating in space waving". T2VZero produces coherent frames but does not fully capture the specifics of each prompt; for instance, the chameleon does not change colors, and the astronaut does not appear to be waving. DirecT2V struggles to generate coherent frames. Interestingly, both DirecT2V and FreeBloom, which are LLM-based models, capture the essence of "waving" and "space" but fail to fully integrate these concepts in each frame. They have strong semantic coherence but not temporal. Our model, however, demonstrates clear color changes in the chameleon, captures the horse's movement (notice the legs), and shows the astronaut's arm moving in a waving pattern while keeping the rest of the frame highly consistent.
  • Figure 4: Ablation Qualitative Results (see \ref{['tab:ablation']}). Each row displays four equally spaced frames from the generated GIF. The prompt is "A cup of coffee being poured with steam rising". The naive approach produces images linked only by theme. Applying OFP without GrPS offers minor improvements while incorporating GrPS (CG with GrPS) notably increases coherence. Finally, combining OFP with GrPS yields the best performance.
  • Figure 6: Qualitative comparison SD1.5 based video-generation models for the prompt: "A first-person view from atop a horse, its ears and mane visible, moving forward across a grassy field”. A fixed seed was used across all models.
  • ...and 26 more figures