Table of Contents
Fetching ...

SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia, Chaotian Song, Qinglin Lu, Boxi Wu, Wei Liu

TL;DR

This work tackles incoherence in one-shot video tuning by introducing a cross-frame noise constraint loss $L_{noise}$ that regularizes noise predictions across adjacent frames, integrated as $L = L_{org} + \lambda_2 L_{noise}$ to yield smoother latents and videos. It theoretically connects frame-wise latent changes to noise predictions and proposes a training-time loss plus an inference-time adjustment to improve temporal coherence. To better quantify smoothness, the authors introduce the video latent (VL) score, a latent-based, sliding-window metric that captures adjacent-frame consistency beyond traditional CLIP-based measures. Applied across multiple baselines and training-free methods, the approach yields consistent improvements in temporal smoothness, with a practical and easily portable optimization and a more reliable evaluation for video synthesis quality.

Abstract

Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.

SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

TL;DR

This work tackles incoherence in one-shot video tuning by introducing a cross-frame noise constraint loss that regularizes noise predictions across adjacent frames, integrated as to yield smoother latents and videos. It theoretically connects frame-wise latent changes to noise predictions and proposes a training-time loss plus an inference-time adjustment to improve temporal coherence. To better quantify smoothness, the authors introduce the video latent (VL) score, a latent-based, sliding-window metric that captures adjacent-frame consistency beyond traditional CLIP-based measures. Applied across multiple baselines and training-free methods, the approach yields consistent improvements in temporal smoothness, with a practical and easily portable optimization and a more reliable evaluation for video synthesis quality.

Abstract

Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.
Paper Structure (19 sections, 13 equations, 7 figures, 2 tables)

This paper contains 19 sections, 13 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparisons to the baseline. By simply employing the proposed noise constraint loss in the training phase, the model produces smoother videos at the inference stage. We highly recommend the readers refer to the supplementary material for better video comparisons.
  • Figure 2: Training overview. We apply the proposed noise constraint loss (smooth loss) in the training process for one-shot video tuning. We follow the same pipeline as Tune-A-Video wu2023tune, which uses a captioned video to finetune a pre-trained text-to-image (T2I) model with modified network architecture to fit video data.
  • Figure 3: The computation of video latent score (VL score) between video frame $n$ and the previous frame $n-1$. The latents from the previous frame slide with a window of size $h\times w$. We calculate the cosine similarity between each resulting latent and the current frame's latent, then select the maximum value to represent the smoothness. This sliding window design is intended to mitigate scene misalignment caused by motion.
  • Figure 4: Qualitative comparisons to Tune-A-Video wu2023tune baseline. Our method significantly improves video consistency and smoothness. For a more detailed and comprehensive comparison, we strongly recommend readers to refer to the supplementary material, which provides additional video comparisons.
  • Figure 5: Qualitative comparisons to ControlVideo zhao2023controlvideo baseline. More detailed and comprehensive video comparisons are included in the supplementary material.
  • ...and 2 more figures