Table of Contents
Fetching ...

FRAG: Frequency Adapting Group for Diffusion Video Editing

Sunjae Yoon, Gwanhyeong Koo, Geonwoo Kim, Chang D. Yoo

TL;DR

FRAG tackles high-frequency leakage in diffusion-based video editing by introducing a Frequency Adapting Group with an adaptive receptive field. It combines Frequency Adaptive Refinement (APF) and Temporal Grouping to dynamically preserve high-frequency details during denoising, guided by the denoising spectral characteristics. The approach is plug-and-play, training-free, and shown to improve frame consistency, fidelity, and high-frequency preservation across multiple diffusion-based editors on TGVE and DAVIS. This frequency-aware, model-agnostic strategy offers a practical path to steadier, more faithful video edits without retraining diffusion models. The work also discusses limitations and avenues for scene-aware grouping and faster, more controllable editing in future work.

Abstract

In video editing, the hallmark of a quality edit lies in its consistent and unobtrusive adjustment. Modification, when integrated, must be smooth and subtle, preserving the natural flow and aligning seamlessly with the original vision. Therefore, our primary focus is on overcoming the current challenges in high quality edit to ensure that each edit enhances the final product without disrupting its intended essence. However, quality deterioration such as blurring and flickering is routinely observed in recent diffusion video editing systems. We confirm that this deterioration often stems from high-frequency leak: the diffusion model fails to accurately synthesize high-frequency components during denoising process. To this end, we devise Frequency Adapting Group (FRAG) which enhances the video quality in terms of consistency and fidelity by introducing a novel receptive field branch to preserve high-frequency components during the denoising process. FRAG is performed in a model-agnostic manner without additional training and validates the effectiveness on video editing benchmarks (i.e., TGVE, DAVIS).

FRAG: Frequency Adapting Group for Diffusion Video Editing

TL;DR

FRAG tackles high-frequency leakage in diffusion-based video editing by introducing a Frequency Adapting Group with an adaptive receptive field. It combines Frequency Adaptive Refinement (APF) and Temporal Grouping to dynamically preserve high-frequency details during denoising, guided by the denoising spectral characteristics. The approach is plug-and-play, training-free, and shown to improve frame consistency, fidelity, and high-frequency preservation across multiple diffusion-based editors on TGVE and DAVIS. This frequency-aware, model-agnostic strategy offers a practical path to steadier, more faithful video edits without retraining diffusion models. The work also discusses limitations and avenues for scene-aware grouping and faster, more controllable editing in future work.

Abstract

In video editing, the hallmark of a quality edit lies in its consistent and unobtrusive adjustment. Modification, when integrated, must be smooth and subtle, preserving the natural flow and aligning seamlessly with the original vision. Therefore, our primary focus is on overcoming the current challenges in high quality edit to ensure that each edit enhances the final product without disrupting its intended essence. However, quality deterioration such as blurring and flickering is routinely observed in recent diffusion video editing systems. We confirm that this deterioration often stems from high-frequency leak: the diffusion model fails to accurately synthesize high-frequency components during denoising process. To this end, we devise Frequency Adapting Group (FRAG) which enhances the video quality in terms of consistency and fidelity by introducing a novel receptive field branch to preserve high-frequency components during the denoising process. FRAG is performed in a model-agnostic manner without additional training and validates the effectiveness on video editing benchmarks (i.e., TGVE, DAVIS).
Paper Structure (37 sections, 10 equations, 16 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 10 equations, 16 figures, 1 table, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of video quality deterioration represented into two distinct categories: (a) content blur and (b) content flicker. For the comparison, we present our results in (c).
  • Figure 2: (a) Frequency magnitude evaluations of videos according to low and high frequencies. (b) Video quality evaluations about frame consistency and fidelity according to low and high-frequency components in videos. Normalized frequency $0<f<\pi$, low frequency: $f< 0.25 \pi$, high frequency: $f> 0.25 \pi$.
  • Figure 3: (a) shows experimental observations about latent noise reconstruction in terms of low and high frequencies, where high-frequency components are synthesized later in denoising than low frequencies. (b) illustrates previous video diffusion denoising and (c) illustrates our proposed denoising with the receptive field branch using Frequency Adapting Group (FRAG) to enhance the quality of editing.
  • Figure 4: Illustration of Frequency Adapting Group (FRAG). FRAG takes $t$ step latent noise $z_{t}$ and produces receptive field $g_{t}$ referred to as temporal group. The $g_{t}$ guides denoising UNet to adaptively synthesize the frequency components according to frequency variations of latent noise during the denoising process. FRAG contains (a) frequency adaptive refinement that enhances the visual quality of attributes within latent noise and (b) temporal grouping that clusters latent noise frames to build $g_{t}$.
  • Figure 5: (a) Denoising spectral characteristic: Average frequency variation according to denoising from 1000 to 0 step on 800 videos in TGVE wu2023cvpr and UCF-101 soomro2012ucf101. (b) $t$ step frequency variation: It is approximated by a normalized distance of moment in the differential frequency distribution.
  • ...and 11 more figures