Table of Contents
Fetching ...

PRIMEdit: Probability Redistribution for Instance-aware Multi-object Video Editing with Benchmark Dataset

Samuel Teodoro, Agus Gunawan, Soo Ye Kim, Jihyong Oh, Munchurl Kim

TL;DR

PRIMEdit tackles the challenge of localized, instance-aware multi-object video editing in a zero-shot setting by introducing two novel components: Instance-centric Probability Redistribution (IPR) and Disentangled Multi-instance Sampling (DMS). IPR provides precise spatial control by redistributing cross-attention probabilities to confine edits within instance masks, while DMS decouples and harmonizes multiple instance edits through series and parallel sampling with latent fusion and re-inversion. To evaluate locality and leakage, the authors introduce the MIVE dataset and the Cross-Instance Accuracy (CIA) score, demonstrating significant improvements in editing faithfulness, temporal consistency, and leakage reduction over state-of-the-art methods. The work also shows robustness across varying instance sizes and numbers, and provides extensive ablations and user studies, highlighting practical applicability and scalability for complex multi-object video edits.

Abstract

Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose $\textbf{P}$robability $\textbf{R}$edistribution for $\textbf{I}$nstance-aware $\textbf{M}$ulti-object Video $\textbf{Edit}$ing ($\textbf{PRIMEdit}$). PRIMEdit is a zero-shot framework that introduces two key modules: (i) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing and (ii) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage. Additionally, we present our new MIVE Dataset for video editing featuring diverse video scenarios, and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that PRIMEdit significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing.

PRIMEdit: Probability Redistribution for Instance-aware Multi-object Video Editing with Benchmark Dataset

TL;DR

PRIMEdit tackles the challenge of localized, instance-aware multi-object video editing in a zero-shot setting by introducing two novel components: Instance-centric Probability Redistribution (IPR) and Disentangled Multi-instance Sampling (DMS). IPR provides precise spatial control by redistributing cross-attention probabilities to confine edits within instance masks, while DMS decouples and harmonizes multiple instance edits through series and parallel sampling with latent fusion and re-inversion. To evaluate locality and leakage, the authors introduce the MIVE dataset and the Cross-Instance Accuracy (CIA) score, demonstrating significant improvements in editing faithfulness, temporal consistency, and leakage reduction over state-of-the-art methods. The work also shows robustness across varying instance sizes and numbers, and provides extensive ablations and user studies, highlighting practical applicability and scalability for complex multi-object video edits.

Abstract

Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose robability edistribution for nstance-aware ulti-object Video ing (). PRIMEdit is a zero-shot framework that introduces two key modules: (i) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing and (ii) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage. Additionally, we present our new MIVE Dataset for video editing featuring diverse video scenarios, and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that PRIMEdit significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing.

Paper Structure

This paper contains 32 sections, 12 equations, 24 figures, 11 tables.

Figures (24)

  • Figure 1: Given a video, instance masks, and target instance captions, our PRIMEdit framework enables faithful and disentangled edits in (a) single- and (b)-(c) multi-instance levels, as well as an applicability to more fine-grained (d) partial instance level without the need for additional training. Unlike previous methods, our PRIMEdit does not rely on global edit captions, but leverages individual instance captions. Each object mask is color-coded to match its corresponding edit caption. Zoom-in for better visualization.
  • Figure 2: Limitations of previous SOTA methods. (a) ControlVideo controlvideo2023zhang relies on single global captions, and (b) GAV groundavideo2024jeong depends on bounding box conditions that can sometimes overlap. Both are susceptible to unfaithful editing (red arrow) and attention leakage (blue arrow).
  • Figure 3: The overall framework of our PRIMEdit, given $M$ number of multi-instance captions $c_i$ with corresponding instance masks $\boldsymbol{m_i}$ for editing. Our Disentangled Multi-instance Sampling (DMS, \ref{['dms']}) consists of series noise sampling (SNS, yellow box), latent fusion (green box) to fuse different instance latents, re-inversion (purple box) to harmonize the latents after fusion, and parallel noise sampling (PNS, blue box). In addition, our Instance-centric Probability Redistribution (IPR, \ref{['ipr']}) provides better spatial control.
  • Figure 4: A comparative illustration of our IPR versus others (top) and details of our IPR (bottom).
  • Figure 5: Qualitative comparison for three videos (with increasing difficulty from left to right) in our MIVE dataset. (a) shows the color-coded masks overlaid on the input frames to match the corresponding instance captions. (b)-(d) use global target captions for editing. (e) uses global and instance target captions along with bounding boxes (omitted in (a) for better visualization). (f) uses masks and global and local target captions. Our PRIMEdit in (g) uses instance captions and masks. Unfaithful editing examples are shown in red arrow and attention leakage are shown in green arrow.
  • ...and 19 more figures