Table of Contents
Fetching ...

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng

TL;DR

This work tackles frame-to-frame inconsistency in diffusion-based video generation by introducing APLA, a lightweight perturbation mechanism that captures intrinsic input information via a Video Generation Transformer (VGT). The VGT operates as a decoder-only Transformer, with two variants (VGT-Pure and VGT-Hyper) that generate tiny latent perturbations to refine temporal predictions without substantially altering content. A Hyper-Loss combines $L_{MSE}$, $L_{L1}$, and perceptual loss to better preserve details, and adversarial training with a 1×1 discriminator enforces temporal coherence across frames. Evaluations demonstrate improved frame consistency (FCI) and semantic alignment (CLIP) in both reconstruction and text-to-video tasks, achieving state-of-the-art results on representative video benchmarks. Overall, APLA offers a practical pathway to more stable, high-quality video synthesis by leveraging intrinsic information and adversarial regularization on top of pre-trained diffusion models.

Abstract

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

TL;DR

This work tackles frame-to-frame inconsistency in diffusion-based video generation by introducing APLA, a lightweight perturbation mechanism that captures intrinsic input information via a Video Generation Transformer (VGT). The VGT operates as a decoder-only Transformer, with two variants (VGT-Pure and VGT-Hyper) that generate tiny latent perturbations to refine temporal predictions without substantially altering content. A Hyper-Loss combines , , and perceptual loss to better preserve details, and adversarial training with a 1×1 discriminator enforces temporal coherence across frames. Evaluations demonstrate improved frame consistency (FCI) and semantic alignment (CLIP) in both reconstruction and text-to-video tasks, achieving state-of-the-art results on representative video benchmarks. Overall, APLA offers a practical pathway to more stable, high-quality video synthesis by leveraging intrinsic information and adversarial regularization on top of pre-trained diffusion models.

Abstract

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.
Paper Structure (15 sections, 20 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 20 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The comparison (by the same prompt: "A man is skiing") between Tune-A-Video and the proposed APLA. (a) The result of Tune-A-Video is that the snowboard splits into multiple parts on these frames. (b) The obtained outcome is by our APLA method which keeps the single snowboard in all frames.
  • Figure 2: Visual demonstrations of APLA using different prompts.
  • Figure 3: The process of training the networks. VGT extracts intrinsic information from latent variables, considering various time steps for noise incorporation, and especially including the clean latent variable devoid of noise, namely the original latent variable $z$. As VGT is not trained ever, the output of VGT is tiny thus the change of the output is small, which is helpful to improve the consistency of different frames without changing the content a lot. The discriminator receives the predicted noise and the noise residuals for corresponding time steps in the diffusion stage.
  • Figure 4: An illustration of VGT-Pure and VGT-Hyper. The left side shows the transformer decoder structure, which adapted mask operation on the self-attention mechanism especially. The right side shows the two versions of VGT. The Temporal Transformer Decoder only receives the class (i.e., cls) token of output sequences of the Spatial Transformer Decoder. The rest of the tokens of the output of the Spatial Transformer Decoder are used to multiply with the tokens of the Temporal Transformer Decoder output dislodging cls token in VGT-Pure, while the whole output of the Temporal Transformer Decoder is transmitted to Transposed Convolution Block directly in VGT-Hyper.
  • Figure 5: (a) is the comparison of different versions of VGT. "EN" represents the use of a transformer encoder instead of a decoder, which means the mask operation was not included. As the picture shows, VGT-Hyper performs the best while the encoder version of VGT-Hyper performs the worst. For VGT-Pure, the encoder version performs similarly to the decoder version, while the performance of the two versions is between VGT-Hyper and VGT-Hyper-EN. (b) shows the ratio of VGT output and U-Net in the denoising step. The result shows that the norm of VGT output is very tiny compared with the U-Net output, which shows that the output of VGT did not change the original output much while improving the consistency of different frames laterally.