Table of Contents
Fetching ...

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Yuxiang Bao, Di Qiu, Guoliang Kang, Baochang Zhang, Bo Jin, Kaiye Wang, Pengfei Yan

TL;DR

LatentWarp addresses temporal coherence in zero-shot video-to-video translation by constraining query tokens and warping latents with optical flow to align adjacent frames in the diffusion process. The method warps latents from the previous frame, uses binary masks to preserve or replace warped regions, and performs latent alignment during early denoising steps to enforce consistent attention across frames. It avoids extensive video training data by operating in the latent space of a pretrained diffusion model and leveraging ControlNets and RAFT-based flow. Empirical results on DAVIS demonstrate superior temporal consistency and style fidelity compared to state-of-the-art zero-shot video translation methods.

Abstract

Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence.

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

TL;DR

LatentWarp addresses temporal coherence in zero-shot video-to-video translation by constraining query tokens and warping latents with optical flow to align adjacent frames in the diffusion process. The method warps latents from the previous frame, uses binary masks to preserve or replace warped regions, and performs latent alignment during early denoising steps to enforce consistent attention across frames. It avoids extensive video training data by operating in the latent space of a pretrained diffusion model and leveraging ControlNets and RAFT-based flow. Empirical results on DAVIS demonstrate superior temporal consistency and style fidelity compared to state-of-the-art zero-shot video translation methods.

Abstract

Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence.
Paper Structure (21 sections, 7 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: LatentWarp, a new zero-shot video-to-video translation framework, possesses the capability of performing high-quality video translation with temporal consistency. By supplying the original video and a target text prompt, LatentWarp empowers users to seamlessly translate diverse videos with the target style, while maintaining the temporal coherence of the original video content.
  • Figure 2: Illustration of the respective effects of cross-frame attention which constrains the key and value tokens to be the same across frames, and our method that additionally constrains the query tokens by warping the latents. The baseline refers to independently translating each frame with ControlNet. By examining the pattern of randomly generated stars in the background, we observe that cross-frame attention only constrains the global style or appearance, while the the details change across the entire sequence. In contrast, our method effectively maintains temporal consistency, ensuring both global and detailed consistency throughout the video.
  • Figure 3: Overview of our proposed framework LatentWarp. In the left part of the figure, we show the overall framework of LatentWarp. In each denosing step, we warp the latent between adjacent frames to make alignment in the latent space. In the right part, we illustrate the technical details about latent warping, binary mask generation and latent alignment.
  • Figure 4: Results. The translation results of two videos with different prompts. It could be seen that our method achieves high translation quality combined with strong temporal coherence.
  • Figure 5: Ablation study. We ablate the effect of latent alignment through visualization results. The results show that latent alignment maintains the fine-grained visual details effectively.
  • ...and 1 more figures