APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng
TL;DR
This work tackles frame-to-frame inconsistency in diffusion-based video generation by introducing APLA, a lightweight perturbation mechanism that captures intrinsic input information via a Video Generation Transformer (VGT). The VGT operates as a decoder-only Transformer, with two variants (VGT-Pure and VGT-Hyper) that generate tiny latent perturbations to refine temporal predictions without substantially altering content. A Hyper-Loss combines $L_{MSE}$, $L_{L1}$, and perceptual loss to better preserve details, and adversarial training with a 1×1 discriminator enforces temporal coherence across frames. Evaluations demonstrate improved frame consistency (FCI) and semantic alignment (CLIP) in both reconstruction and text-to-video tasks, achieving state-of-the-art results on representative video benchmarks. Overall, APLA offers a practical pathway to more stable, high-quality video synthesis by leveraging intrinsic information and adversarial regularization on top of pre-trained diffusion models.
Abstract
Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.
