Table of Contents
Fetching ...

Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel

TL;DR

This work tackles text-video prediction by addressing the need for temporal continuity when extending videos conditioned on both initial frames and natural language. It introduces Frame-wise Conditioning Adaptation (FCA), an adaptation-based fine-tuning strategy that adds parallel FCA attention blocks into a frozen diffusion transformer (DiT) base and injects initial-frame latents along with frame-wise text conditioning into cross-attention pathways. The method demonstrates state-of-the-art performance on TVP benchmarks, achieving substantial reductions in FVD and competitive results against image-to-video baselines, while providing detailed ablations and training insights. The approach offers a practical, scalable path for leveraging large pre-trained T2V models for TVP tasks with improved temporal coherence and text alignment, accompanied by open-source code for reproducibility.

Abstract

Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. Our code is open-source at https://github.com/Cuberick-Orion/FCA .

Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

TL;DR

This work tackles text-video prediction by addressing the need for temporal continuity when extending videos conditioned on both initial frames and natural language. It introduces Frame-wise Conditioning Adaptation (FCA), an adaptation-based fine-tuning strategy that adds parallel FCA attention blocks into a frozen diffusion transformer (DiT) base and injects initial-frame latents along with frame-wise text conditioning into cross-attention pathways. The method demonstrates state-of-the-art performance on TVP benchmarks, achieving substantial reductions in FVD and competitive results against image-to-video baselines, while providing detailed ablations and training insights. The approach offers a practical, scalable path for leveraging large pre-trained T2V models for TVP tasks with improved temporal coherence and text alignment, accompanied by open-source code for reproducibility.

Abstract

Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. Our code is open-source at https://github.com/Cuberick-Orion/FCA .

Paper Structure

This paper contains 57 sections, 4 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Overview. Text-video prediction (TVP) models generate subsequent video frames on the basis of the initial frame(s) and a natural language description of the required motion. Our method leverages a pre-trained text-to-video (T2V) diffusion transformer (DiT) model, while introducing an effective adaptation method (FCA) for fine-tuning. Our method integrates the initial frames, as well as frame-wise text conditions to aid the generation.
  • Figure 2: Details of the frame-wise text conditioning module inspired by Q-Former li2023blip, and its integration with FCA. We only show one DiT layer here for clarity, but note that we separately apply a frame-wise text conditioning module to every layer. In total, we initialize and train $D$ such modules for the $D$ layers of the DiT. This figure complements Figure \ref{['fig:model-0']} (FCA module, bottom-left block). ViT stands for the Vision Transformer dosovitskiyimage.
  • Figure 3: An illustration of Frame-wise Conditioning Adaptation (FCA) on diffusion transformer (DiT). Right: an arbitrary pre-trained DiT block peebles2023scalable_dit. Left: the proposed FCA module, introduced in Section \ref{['sec:method-adapter']}, which incorporates frame-wise text conditioning discussed in Section \ref{['sec:method-frame-wise-text']}. We only show one DiT layer here for clarity, but note that we separately apply the same modules to every layer. $\mathbf{y}$, $\mathbf{x}_t$ denote the text tokens and noisy latent for a DiT, respectively; $\left[\cdot; \cdot\right]$ denotes concatenation. $\mathbf{x}_\text{init}$ represents the latent of the initial frames introduced in Section \ref{['sec:method-adapter']}, and $\mathbf{y}_\text{frames}$ is the frame-wise text conditioning embeddings. A frame-wise attention mask is applied within the multi-head cross-attention module, which is omitted in this figure, see Section \ref{['sec:method-frame-wise-text']}. $\alpha,\beta,\gamma$ are the learned parameters of the adaptive Layernorm (adaLN) peebles2023scalable_dit. The trainable module is marked with a fire symbol. See Figure \ref{['fig:qformer-arch']} for details of the frame-wise text conditioning module.
  • Figure 4: Qualitative examples on video generation. We compare our method with Seer gu_seer, where we also show the ground truth (GT). For inference on Seer, we adhere to its configuration with a frame number of 12; while ours is of 16 (split into two rows). For illustration purposes, we replace the conditioning frames (first two) with the ground truth. See qualitative examples on other datasets in Section \ref{['sec:supp-quali']}. Note that the aspect ratios of the figures are slightly adjusted for demonstration purposes.
  • Figure 5: Comparison with LoRA hu2022lora fine-tuning and pre-trained I2V model (CogVideoX1.5-5B-I2V). We replace the conditioning initial frames (first two) with the ground truth. Each sample is of 16 frames, split into two rows. Text prompt: "Pouring something out of something".
  • ...and 10 more figures