Table of Contents
Fetching ...

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy

TL;DR

This paper introduces FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint, which ensures a more consistent transformation of semantically similar content across frames.

Abstract

The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

TL;DR

This paper introduces FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint, which ensures a more consistent transformation of semantically similar content across frames.

Abstract

The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.
Paper Structure (14 sections, 9 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 9 equations, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Our framework enables high-quality and coherent video translation based on pre-trained image diffusion model. Given an input video, our method re-renders it based on a target text prompt, while preserving its semantic content and motion. Our zero-shot framework is compatible with various assistive techniques like ControlNet, SDEdit and LoRA, enabling more flexible and customized translation.
  • Figure 2: Real video to CG video translation. Methods yang2023rerender relying on optical flow alone suffer (a)(f) inconsistent or (c)(d)(e) missing optical flow guidance and (b) error accumulation. By introducing FreSCo, our method addresses these challenges well.
  • Figure 3: Framework of our zero-shot video translation guided by FRamE Spatial-temporal COrrespondence (FreSCo). A FreSCo-aware optimization is applied to the U-Net features to strengthen their temporal and spatial coherence with the input frames. We integrate FreSCo into self-attention layers, resulting in spatial-guided attention to keep spatial correspondence with the input frames, efficient cross-frame attention and temporal-guided attention to keep rough and fine temporal correspondence with the input frames, respectively.
  • Figure 4: Illustration of attention mechanism. The patches marked with red crosses attend to the colored patches and aggregate their features. Compared to previous attentions, FreSCo-guided attention further considers intra-frame and inter-frame correspondences of the input. Spatial-guided attention aggregates intra-frame features based on the self-similarity of the input frame (darker indicates higher weights). Efficient cross-frame attention eliminates redundant patches and retains unique patches. Temporal-guided attention aggregates inter-frame features on the same flow.
  • Figure 5: Visual comparison with inversion-free zero-shot video translation methods.
  • ...and 10 more figures