Table of Contents
Fetching ...

OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models

Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari

TL;DR

OmnimatteZero introduces a training-free framework that leverages pre-trained video diffusion models to perform object removal, foreground extraction with associated effects, and seamless layer composition in real time. It advances the field by incorporating Temporal Attention Guidance (TAG) and Spatial Attention Guidance (SAG) to maintain temporal coherence and spatial detail during inpainting, and by exploiting self-attention maps to recover object footprints such as shadows and reflections. Foreground layers are obtained via latent arithmetic with pixel-space refinements, enabling fast, flexible recomposition onto new backgrounds. On Movies and Kubric datasets, OmnimatteZero achieves state-of-the-art background reconstruction (highest PSNR and lowest LPIPS) at 0.04 s per frame, outperforming training-based and self-supervised baselines and enabling scalable, real-time video editing without per-video optimization.

Abstract

In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models

TL;DR

OmnimatteZero introduces a training-free framework that leverages pre-trained video diffusion models to perform object removal, foreground extraction with associated effects, and seamless layer composition in real time. It advances the field by incorporating Temporal Attention Guidance (TAG) and Spatial Attention Guidance (SAG) to maintain temporal coherence and spatial detail during inpainting, and by exploiting self-attention maps to recover object footprints such as shadows and reflections. Foreground layers are obtained via latent arithmetic with pixel-space refinements, enabling fast, flexible recomposition onto new backgrounds. On Movies and Kubric datasets, OmnimatteZero achieves state-of-the-art background reconstruction (highest PSNR and lowest LPIPS) at 0.04 s per frame, outperforming training-based and self-supervised baselines and enabling scalable, real-time video editing without per-video optimization.

Abstract

In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

Paper Structure

This paper contains 26 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Comparison of object removal results using (a) vanilla image inpainting extended to video, (b) Vid2Vid zero-shot inpainting, and (c) our guidance-based approach. Vanilla methods fail to maintain temporal consistency or clean background reconstruction, while our method achieves coherent, artifact-free inpainting across frames.
  • Figure 2: Overview of our Object Removal strategy in OmnimatteZero.(a) We first identify potential background correspondences across frames. (b) Temporal Attention Guidance (TAG): Temporal attention scores between a foreground point and its background correspondences are replaced with the average attention between all background pairs, promoting consistent inpainting across time. (c) Spatial Attention Guidance (SAG): Within a frame, the attention from a foreground point to nearby background points is adjusted to reflect the mean attention among background points themselves, improving inpainting quality when temporal context is unavailable.
  • Figure 3: Self-attention maps from (a) LTX Video diffusion model and (b) Stable Diffusion (image based). The spatio-temporal video latent "attends to object associated effects" (e.g., shadow, reflection) where, image models struggles to capture these associations.
  • Figure 4: (a) Foreground Extraction: The target object is extracted by latent code arithmetic, subtracting the background video encoding from the object+background latent (Latent Diff). This initially results in distortions, which are later corrected by replacing pixel values using the original video and a user-provided mask (Latent Diff + Pixel injection). (b) Layer Composition: The extracted object layer is added to a new background latent (Latent Addition). To improve blending, a few steps of noising-denoising are applied, yielding a more natural integration of the object into the new scene (Latent Addition + Refinement). See video examples in the supp material.
  • Figure 5: Qualitative Results: Object removal and background reconstruction. The first row shows input video frames with object masks, while the second row presents the reconstructed backgrounds. Our approach effectively removes objects while preserving fine details, reflections, and textures, demonstrating robustness across diverse scenes. Notice the removal of the cat’s reflection in the mirror and water, the shadow of the dog and bicycle (with the rider), and the bending of the trampoline when removing the jumpers. See video examples in the supplemental material
  • ...and 6 more figures