Table of Contents
Fetching ...

Object-Centric Diffusion for Efficient Video Editing

Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

TL;DR

Object-Centric Diffusion is introduced, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality.

Abstract

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.

Object-Centric Diffusion for Efficient Video Editing

TL;DR

Object-Centric Diffusion is introduced, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality.

Abstract

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.
Paper Structure (46 sections, 2 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 46 sections, 2 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: OCD speeds up video editing. We show exemplar editing results of FateZero fatezero with and without our OCD optimizations. When including our techniques, the editing is $10\times$ faster than the baseline with similar generation quality.
  • Figure 2: Latency analysis of video editing models. At various diffusion steps, latency is dominated by memory access operations. Among pure computations, attention alone is the bottleneck, especially when using dense cross-frame interactions. As attention is the main responsible for most of the memory overhead, we hypothesize that reducing the number of its tokens have a significant impact on latency.
  • Figure 3: Off-the-shelf accelerations: First, we replace the (b) default sampler with (c) DPM++ dpmdpm++, allowing to reduce sampling steps from 50$\rightarrow$20 without a heavy degradation. Then, by applying ToMe, memory and computational overhead decreases, yet results degrade significantly (d). We therefore implement (e) pairing of ToMe indexes between inversion and generation and (f) per-frame resampling of destination tokens, regaining the quality. Altogether, we coin the resulting model Optimized-FateZero.
  • Figure 4: Object-Centric Token Merging: By artificially down-weighting the similarities of source tokens of foreground objects, we accumulate in their locations tokens that are left unmerged (in blue). Destination tokens (in red) are still sampled randomly within a grid, preserving some background information. Merged source tokens (not represented for avoiding cluttering) will come from the background.
  • Figure 5: Qualitative comparison with the sota editing methods: OCD yields significantly-faster generations over the baseline we build on-top of (i.e., FateZero fatezero) and other state-of-the-art methods, without sacrificing quality. Tune-A-Video tuneavideo is finetuned on each sequence (denoted with *, finetuning time not included in latency).
  • ...and 7 more figures