Table of Contents
Fetching ...

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks

TL;DR

TI2V-Zero presents a zero-shot approach to text-conditioned image-to-video generation by conditioning a frozen pretrained text-to-video diffusion model on a provided starting image. The method introduces a repeat-and-slide frame-wise generation, DDPM-based inversion for better frame initialization, and resampling to maintain temporal coherence, enabling autoregressive long videos without any training. Experiments on MUG, UCF101, and OPEN show competitive or superior performance to open-domain TI2V baselines, with ablations confirming the importance of inversion and resampling. While effective, the approach relies on the quality of the base T2V model and incurs slower inference due to per-frame diffusion, suggesting future work in faster sampling and post-processing to address artifacts.

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

TL;DR

TI2V-Zero presents a zero-shot approach to text-conditioned image-to-video generation by conditioning a frozen pretrained text-to-video diffusion model on a provided starting image. The method introduces a repeat-and-slide frame-wise generation, DDPM-based inversion for better frame initialization, and resampling to maintain temporal coherence, enabling autoregressive long videos without any training. Experiments on MUG, UCF101, and OPEN show competitive or superior performance to open-domain TI2V baselines, with ablations confirming the importance of inversion and resampling. While effective, the approach relies on the quality of the base T2V model and incurs slower inference due to per-frame diffusion, suggesting future work in faster sampling and post-processing to address artifacts.

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
Paper Structure (13 sections, 7 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 7 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples of generated video frames using our proposed TI2V-Zero. The given first image $x^0$ is highlighted with the red box, and the text condition $y$ is shown under each row of the video. The remaining columns show the 6th, 11th, and 16th frames of the generated output videos. Each generated video has 16 frames with a resolution of $256\times256$.
  • Figure 2: Illustration of the process of applying TI2V-Zero to generate the new frame $\hat{x}^{i+1}$, given the starting image $x^0$ and text $y$. TI2V-Zero is built upon a frozen pretrained T2V diffusion model, including frame encoder $\mathcal{E}$, frame decoder $\mathcal{D}$, and the denoising U-Net $\epsilon_\theta$. At the beginning of generation ($i=0$), we encode $x^0$ as $z^0$ and repeat it $K$ times to form the queue $\mathbf{s}_0$. We then apply DDPM-based inversion to $\mathbf{s}_0$ to produce the initial Gaussian noise $\hat{\mathbf{z}}_T$. Subsequently, in each reverse denoising step using U-Net $\epsilon_\theta$, we keep replacing the first $K$ frames of $\hat{\mathbf{z}}_t$ with the noisy latent code $\mathbf{s}_t$ derived from $\mathbf{s}_0$. Resampling is also applied within each step to improve motion coherence. We finally decode the final frame of the clean latent code $\hat{\mathbf{z}}_0$ as the new synthesized frame $\hat{x}^{i+1}$. To compute the new $\mathbf{s}_0$ for the next iteration of generation ($i>0$), we perform a sliding operation by dequeuing ${s}^0_0$ and enqueuing $\hat{z}^K_0$ within $\mathbf{s}_0$.
  • Figure 3: Illustration of the motivation behind our framework. We explore the application of a replacing-based baseline approach (rows 2--4, labeled "Replacing") and our TI2V-Zero (rows 5--6, labeled "TI2V-Zero") in various video generation tasks. The given real frames for each task are highlighted by red boxes and the text input is shown under the block. The replacing-based approach is only effective at predicting a single frame when all the other frames in the video are provided, while TI2V-Zero generates temporally coherent videos for both the TI2V and video infilling tasks.
  • Figure 4: Qualitative ablation study comparing different sampling strategies for our TI2V-Zero on MUG. The first image $\hat{x}^0$ is highlighted with the red box and text $y$ is shown under the block. The 1st, 6th, 11th, and 16th frames of the videos are shown in each column. The terms Inversion, DDIM, and Resample denote the application of DDPM inversion, the steps using DDIM sampling, and the iteration number using resampling, respectively.
  • Figure 5: Qualitative comparison among different methods on multiple datasets for TI2V generation. Columns in each block display the 1st, 6th, 11th, and 16th frames of the output videos, respectively. There are 16 frames with a resolution of $256\times256$ for each video. The given image $x^0$ is highlighted with the red box and the text prompt $y$ is shown under each block.
  • ...and 1 more figures