Table of Contents
Fetching ...

Generative Omnimatte: Learning to Decompose Video into Layers

Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, Forrester Cole

TL;DR

The paper tackles the challenge of decomposing casual video into semantically meaningful RGBA layers with associated effects, without relying on static backgrounds or precise pose/depth. It introduces Generative Omnimatte, a two-stage pipeline that finetunes a diffusion-based object–effect removal model (Casper) and then reconstructs omnimatte layers via test-time optimization guided by trimasks. By leveraging a learned generative prior, the method completes occluded regions and handles dynamic backgrounds, achieving superior qualitative and quantitative results and enabling layer-based editing tasks. The work provides a practical framework with curated real and synthetic training data, while acknowledging limitations in multi-object disentanglement and prior biases, and outlines directions for improvement and data release.

Abstract

Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.

Generative Omnimatte: Learning to Decompose Video into Layers

TL;DR

The paper tackles the challenge of decomposing casual video into semantically meaningful RGBA layers with associated effects, without relying on static backgrounds or precise pose/depth. It introduces Generative Omnimatte, a two-stage pipeline that finetunes a diffusion-based object–effect removal model (Casper) and then reconstructs omnimatte layers via test-time optimization guided by trimasks. By leveraging a learned generative prior, the method completes occluded regions and handles dynamic backgrounds, achieving superior qualitative and quantitative results and enabling layer-based editing tasks. The work provides a practical framework with curated real and synthetic training data, while acknowledging limitations in multi-object disentanglement and prior biases, and outlines directions for improvement and data release.

Abstract

Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.

Paper Structure

This paper contains 31 sections, 4 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Generative Omnimatte. Our method decomposes a video into a set of RGBA omnimatte layers, where each layer consists of a fully-visible object and its associated effects like shadows and reflections. We improve upon existing work by adding a generative video prior, allowing our method to complete occluded regions (top, middle) and handle dynamic backgrounds (bottom).
  • Figure 2: Limitations of existing Omnimatte methods. Omnimatte methods omnimatteomnimatte3domnimatterffactormatte rely on restrictive motion assumptions, such as stationary background, resulting in dynamic background elements becoming entangled with foreground object layers. Furthermore, these methods lack a generative and semantic prior for completing occluded pixels and accurately associating effects with their corresponding objects.
  • Figure 3: Limitations of inpainting models for object removal. While video inpainting models (e.g., propainter) can complete plausible background pixels in the input mask region, they preserve the removed objects' shadows and reflections outside the mask.
  • Figure 4: Generative omnimatte framework. Given an input video and binary object masks, we first apply our object-effect-removal model, Casper, to generate a clean-plate background $\mathcal{I}_{\mathrm{bg}}$ and a set of single-object (solo) videos $\mathcal{I}_i$ applying different trimask conditions. The trimasks specify regions to preserve (white), remove (black), and regions that potentially contain uncertain object effects (gray). In Stage 2, a test-time optimization reconstructs the omnimatte layers $\mathcal{O}_i$ from pairs of $\mathcal{I}_i$ and $\mathcal{I}_{\mathrm{bg}}$.
  • Figure 5: Effect association prior in a pretrained video generation model. We input a video to a pretrained text-to-video generator lumiere using an SDEdit approach sdedit and analyze the spatial self-attention weights. By measuring the attention weights between query tokens and key tokens located within target object areas, we observe that the generator can effectively associate effects with target objects. In this specific example, we re-noise the video to $t=0.5$ and visualize the attention in the middle bottleneck of the U-Net at the sampling step $t=0.125$.
  • ...and 10 more figures