Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing
Tongtong Su, Chengyu Wang, Jun Huang, Dongming Lu
TL;DR
The paper tackles the challenge of precise, temporally consistent video appearance editing guided by a reference image. It introduces a two-stage Zero-to-Hero framework: Zero-Stage establishes accurate cross-frame correspondence to guide appearance transfer, while Hero-Stage learns a conditional diffusion model with LoRA-based conditioning to holistically restore the video and balance information from multiple inputs. The approach uses diffusion-feature correspondence (DIFT) within masked cross-image attention to maintain structure and apply reference appearance, and employs two conditioning modes to prevent leakage and improve generalization across frames. Experiments on Blender-generated datasets and standard baselines show improved fidelity and temporal consistency, with PSNR gains around 2.6 dB and strong performance in both qualitative and quantitative metrics. Overall, Zero-to-Hero offers a robust, memory-efficient method for reference-based video editing that handles large motion and complex appearances more reliably than prior work.
Abstract
Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.
