Table of Contents
Fetching ...

Zero-Shot Video Translation via Token Warping

Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

TL;DR

This work addresses the challenge of temporally coherent zero-shot video translation using diffusion models. It introduces TokenWarping, which warps query, key, and value tokens via optical flow with occlusion handling and anchor tokens to enforce long-term consistency, all without training. The approach significantly improves temporal coherence and editing accuracy compared with prior zero-shot and inversion-based methods, while maintaining practical runtimes. The framework integrates with Stable Diffusion and ControlNet, enabling editable video translations guided by text prompts and structure cues, with broad implications for diffusion-based video editing workflows.

Abstract

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame's query, key, and value patches, aligning them with the current frame's patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations can be found on our project webpage: https://alex-zhu1.github.io/TokenWarping/. Code is available at: https://github.com/Alex-Zhu1/TokenWarping.

Zero-Shot Video Translation via Token Warping

TL;DR

This work addresses the challenge of temporally coherent zero-shot video translation using diffusion models. It introduces TokenWarping, which warps query, key, and value tokens via optical flow with occlusion handling and anchor tokens to enforce long-term consistency, all without training. The approach significantly improves temporal coherence and editing accuracy compared with prior zero-shot and inversion-based methods, while maintaining practical runtimes. The framework integrates with Stable Diffusion and ControlNet, enabling editable video translations guided by text prompts and structure cues, with broad implications for diffusion-based video editing workflows.

Abstract

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame's query, key, and value patches, aligning them with the current frame's patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations can be found on our project webpage: https://alex-zhu1.github.io/TokenWarping/. Code is available at: https://github.com/Alex-Zhu1/TokenWarping.
Paper Structure (19 sections, 13 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 13 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Self-attention features visualization. The top two rows show attention features visualization after PCA, and the bottom two rows show the translated frames. Prompt: A white cat in pink background.
  • Figure 2: We propose a novel zero-shot video translation method, TokenWarping. Given the prompt "cartoon style, in the castle", TokenWarping effectively transfers both the cartoon style and the background castle. In contrast, existing methods tend to overfit the source video, failing to edit the background.
  • Figure 3: Pipeline of our TokenWarping: Given a source video $\mathcal{V}$, we first predict the optical flow $\mathcal{F}$ and occlusion mask $\mathcal{M}$ using the method from xu2022gmflow. We then feed the sequence condition $\mathcal{C}$ and target prompt $\mathcal{P^*}$ to ControlNet, which controls the outputs of Stable Diffusion. During each denoising step, we warp the query, key, and value tokens in the U-Net decoder's self-attention layers using optical flow. At each timestep $i$, we sample the anchor area ($1_{st}$ frame) of the key and value patches and concatenate it with the warped patches along the feature axis. Additionally, we use optical flow to warp the query patches, enhancing local temporal consistency. The detailed illustration of warping and fusion are shown in bottom-right.
  • Figure 4: Qualitative comparisons with zero-shot video methods. TokenWarping aligns with the video structure and target prompt.
  • Figure 5: Qualitative comparisons with flow-attention competitors. Prompt: A sculpture of a woman running.
  • ...and 7 more figures