Table of Contents
Fetching ...

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Ting-Hsuan Chen, Jiewen Chan, Hau-Shiang Shiu, Shih-Han Yen, Chang-Han Yeh, Yu-Lun Liu

TL;DR

A video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics.

Abstract

We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video. Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics. By introducing a diffusion prior from the early stages of training, our model ensures that the generated images retain a high-quality natural appearance, making the produced canonical images suitable for various downstream tasks in video editing, a capability not achieved by current canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA) fine-tuning and introduce a noise and diffusion prior update scheduling technique that accelerates the training process by 14 times. Extensive experimental results show that our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences. See our project page for video results at https://koi953215.github.io/NaRCan_page/.

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

TL;DR

A video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics.

Abstract

We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images to represent the input video. Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations, enhancing the model's ability to handle complex video dynamics. By introducing a diffusion prior from the early stages of training, our model ensures that the generated images retain a high-quality natural appearance, making the produced canonical images suitable for various downstream tasks in video editing, a capability not achieved by current canonical-based methods. Furthermore, we incorporate low-rank adaptation (LoRA) fine-tuning and introduce a noise and diffusion prior update scheduling technique that accelerates the training process by 14 times. Extensive experimental results show that our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences. See our project page for video results at https://koi953215.github.io/NaRCan_page/.
Paper Structure (34 sections, 2 equations, 15 figures, 1 table)

This paper contains 34 sections, 2 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Video representation with diffusion prior. Given an RGB video, we can represent the video using a canonical image. However, the canonical image and reconstruction training process focuses only on reconstruction quality and could produce an unnatural canonical image. This could cause problems with downstream tasks such as prompt-based video editing. In the bottom example, if the hand is distorted in the canonical image, the image editor, such as ControlNet zhang2023adding, may not recognize it and could introduce an irrelevant object instead. In this paper, we propose introducing the diffusion prior from a LoRA hu2021lora fine-tuned diffusion model to the training pipeline and constraining the canonical image to be natural. Our method facilitates several downstream tasks, such as (a) video editing, (b) dynamic segmentation, and (c) video style transfer.
  • Figure 2: Our proposed framework. Given an input video sequence, our method aims to represent the video with a natural canonical image, which is a crucial representation for versatile downstream applications. (a) First, we fine-tune the LoRA weights of a pre-trained latent diffusion model on the input frames. (b) Second, we represent the video using a canonical MLP and a deformation field, which consists of homography estimation and residual deformation MLP for non-rigid residual deformations. By relying entirely on the reconstruction loss, the canonical MLP often fails to represent a natural canonical image, causing problems for downstream applications. E.g., image-to-image translation methods such as ControlNet zhang2023adding may not be able to recognize that there is a train in the canonical image. (c) Therefore, we leverage the fine-tuned latent diffusion model to regularize and correct the unnatural canonical image into a natural one. Specifically, we sophistically design a noise scheduling corresponding to the frame reconstruction process. (d) The natural and artifacts-free canonical image can then be facilitated to various downstream tasks such as video style transfer, dynamic segmentation, and editing, such as adding handwritten characters of "NaRCan".
  • Figure 3: Noise and diffusion prior update scheduling. Initially, our model fits object outlines before the fields converge and without the diffusion prior, resulting in unnatural elements in the canonical image due to complex non-rigid objects. Upon introducing the diffusion prior with increased noise and update frequency, the model learns to generate natural, high-quality images, leading to convergence. Thus, the strength of noise and the update frequency will also decrease. Moreover, it's worth mentioning that update scheduling cuts training time from 4.8 hours to 20 minutes.
  • Figure 4: Linear interpolation. After using the grid trick hu2021lora to obtain the highly consistent canonical images $C_k$ and $C_{k+1}$, we interpolate all frames within the overlap window. As time progresses, the weight for reconstructing each frame gradually shifts from referencing $C_k$ to solely referencing $C_{k+1}$. We achieve editing results with remarkable temporal consistency through this linear interpolation approach. Please refer to our supplementary material for more video results.
  • Figure 5: Qualitative comparisons on text-guided video-to-video translation. Our method achieves prompt alignment, synthesis quality, and temporal consistency best. Zoom in for the best view, and please refer to the supplementary materials for video comparisons. (a) In the camel scene, Medm chu2024medm fails to generate clear-textured images to ensure temporal consistency, while CCEdit feng2023ccedit fails to correctly identify the second camel in the background. (b) CoDeF ouyang2023codef misses capturing the presence of a person in the bottom right corner, Hashing-nvd chan2023hashing exhibit noticeable contours due to masking, and both MeDM and CCEdit suffer from temporal inconsistency issues. For instance, in MeDM, the person transitions from wearing black clothes to blue clothes. (c) MeDM and CCEdit still exhibit temporal inconsistency issues, such as significant color, texture, and structure changes. Other methods almost entirely lose the original train information or appear as unnatural artifacts.
  • ...and 10 more figures