Table of Contents
Fetching ...

Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg

TL;DR

FlexTI2V presents a training-free method to condition text-to-video diffusion models on an arbitrary number of images at flexible positions, achieved by inverting condition images to latent noise representations and injecting visual features through a novel random patch swapping mechanism with dynamic control. The approach achieves state-of-the-art performance among training-free TI2V methods on UCF-101 across image animation, rewinding, inpainting/outpainting, and interpolation, while also generalizing to transformer-based Wan2.1 models and different architectures, and delivering efficient inference. Ablation and qualitative analyses confirm the essential roles of random patch swapping and dynamic control in balancing fidelity to condition images with creative motion. While effective, the method inherits limitations from the base T2V models, including camera viewpoint transitions and watermark biases, pointing to future work on explicit camera motion modeling and broader conditioning modalities.

Abstract

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.

Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

TL;DR

FlexTI2V presents a training-free method to condition text-to-video diffusion models on an arbitrary number of images at flexible positions, achieved by inverting condition images to latent noise representations and injecting visual features through a novel random patch swapping mechanism with dynamic control. The approach achieves state-of-the-art performance among training-free TI2V methods on UCF-101 across image animation, rewinding, inpainting/outpainting, and interpolation, while also generalizing to transformer-based Wan2.1 models and different architectures, and delivering efficient inference. Ablation and qualitative analyses confirm the essential roles of random patch swapping and dynamic control in balancing fidelity to condition images with creative motion. While effective, the method inherits limitations from the base T2V models, including camera viewpoint transitions and watermark biases, pointing to future work on explicit camera motion modeling and broader conditioning modalities.

Abstract

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.

Paper Structure

This paper contains 29 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: We propose FlexTI2V, a novel training-free approach that can add flexible image conditioning to off-the-shelf text-to-video foundation models. We are able to frame an arbitrary number of images at arbitrary positions in the synthetic video with vivid motion and smooth transitions. In this figure, images with blue edges are condition images, and images with green edges are generated video frames.
  • Figure 2: Comparison with classic TI2V tasks. Our task requires video generation conditioned on any number of images at any positions in the output video, which unifies existing classic TI2V tasks. The images with blue and pink edges are condition images, and images with green edges are generated video frames.
  • Figure 3: Overview of the proposed FlexTI2V approach. We invert the condition image embedding to noisy representation $\tilde{\bm{x}}_t$ at each step. The final noise $\tilde{\bm{x}}_T$ is reused as initialization for video synthesis. At step $t$, we directly replace the video frames with images at the desired positions. Then, for each video frame, we randomly swap a portion of patches with bounded condition images based on the relative distance between the frame and images. Though we show a special case of using two condition images in this figure, our method can naturally extend to any number of images at any positions. Note that all operations of our method occur in the latent space. We visualize RGB images and frames on the latent representations simply for intuitive understanding.
  • Figure 4: Comparison with prior methods. The images with blue and pink edges are condition images for each setting, and images with green edges are generated video frames. Our approach synthesizes videos with higher frame consistency and fidelity to condition images than other baseline models in various settings.
  • Figure 5: Generated videos of our method implemented to Wan2.1-T2V. The images with blue edges are condition images and images with green edges are generated video frames. Please refer to the supplementary for corresponding videos and extra demos.
  • ...and 5 more figures