Table of Contents
Fetching ...

Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

Tongtong Su, Chengyu Wang, Jun Huang, Dongming Lu

TL;DR

The paper tackles the challenge of precise, temporally consistent video appearance editing guided by a reference image. It introduces a two-stage Zero-to-Hero framework: Zero-Stage establishes accurate cross-frame correspondence to guide appearance transfer, while Hero-Stage learns a conditional diffusion model with LoRA-based conditioning to holistically restore the video and balance information from multiple inputs. The approach uses diffusion-feature correspondence (DIFT) within masked cross-image attention to maintain structure and apply reference appearance, and employs two conditioning modes to prevent leakage and improve generalization across frames. Experiments on Blender-generated datasets and standard baselines show improved fidelity and temporal consistency, with PSNR gains around 2.6 dB and strong performance in both qualitative and quantitative metrics. Overall, Zero-to-Hero offers a robust, memory-efficient method for reference-based video editing that handles large motion and complex appearances more reliably than prior work.

Abstract

Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.

Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing

TL;DR

The paper tackles the challenge of precise, temporally consistent video appearance editing guided by a reference image. It introduces a two-stage Zero-to-Hero framework: Zero-Stage establishes accurate cross-frame correspondence to guide appearance transfer, while Hero-Stage learns a conditional diffusion model with LoRA-based conditioning to holistically restore the video and balance information from multiple inputs. The approach uses diffusion-feature correspondence (DIFT) within masked cross-image attention to maintain structure and apply reference appearance, and employs two conditioning modes to prevent leakage and improve generalization across frames. Experiments on Blender-generated datasets and standard baselines show improved fidelity and temporal consistency, with PSNR gains around 2.6 dB and strong performance in both qualitative and quantitative metrics. Overall, Zero-to-Hero offers a robust, memory-efficient method for reference-based video editing that handles large motion and complex appearances more reliably than prior work.

Abstract

Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Left: Our reference-based editing method enables users to precisely edit appearances by incorporating complex layouts of color with arbitrary tools such as Photoshop or ComfyUI to create references, then consistently propagate these edits to subsequent frames. Right: Our approach supports all spatially-aligned appearance editing, including texture and style.
  • Figure 2: Our framework. Zero-Stage: Correspondences ($Corr$) estimated from the anchor and target frames are utilized to guide Cross-image Attention ($Attn$) between the reference and anchor frames, enabling accurate appearance transfer in a zero-shot manner. Hero-Stage: We learn a conditional generative model by incorporating LoRA to process conditional tokens. There are two modes of condition injection: one condition with one LoRA (Mode 1) and two conditions with two independent LoRAs (Mode 2). Four pairs of images serve as potential training data, from a to d (see Table \ref{['tab:data_pairs']}).
  • Figure 3: Left:$Corr$ guidance with increasing $k$. When $k=h\times w$, it corresponds to the original Cross-image Attention. Right: Using other references results in similar missing patterns (red and green boxes).
  • Figure 4: Mode 1 (a+c) can better preserve target structure of car than only using a, but it struggles with style transfer (e.g., watercolor in the second row) and restoring severely missing background regions. Mode 2 (b) can solve two problems. d will result in appearance leakage of target frame.
  • Figure 5: Blender-Color-Edit dataset, rendered in Blender.
  • ...and 3 more figures