Table of Contents
Fetching ...

Using Diffusion Priors for Video Amodal Segmentation

Kaihua Chen, Deva Ramanan, Tarasha Khurana

TL;DR

This work addresses video amodal segmentation by introducing a diffusion-prior, two-stage pipeline that leverages temporal priors from Stable Video Diffusion. The first stage predicts amodal masks from sequences of modal masks and pseudo-depth maps, while the second stage completes the RGB content of occluded regions. Training relies on synthetic modal-amodal pairs, enabling strong state-of-the-art results on synthetic and real datasets with notable zero-shot generalization, and supporting downstream tasks like 4D reconstruction and scene editing. By conditioning diffusion models on multi-frame shape priors and depth, the approach achieves robust occlusion handling and multiple plausible completions, advancing practical video understanding beyond visible regions.

Abstract

Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo-depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.

Using Diffusion Priors for Video Amodal Segmentation

TL;DR

This work addresses video amodal segmentation by introducing a diffusion-prior, two-stage pipeline that leverages temporal priors from Stable Video Diffusion. The first stage predicts amodal masks from sequences of modal masks and pseudo-depth maps, while the second stage completes the RGB content of occluded regions. Training relies on synthetic modal-amodal pairs, enabling strong state-of-the-art results on synthetic and real datasets with notable zero-shot generalization, and supporting downstream tasks like 4D reconstruction and scene editing. By conditioning diffusion models on multi-frame shape priors and depth, the approach achieves robust occlusion handling and multiple plausible completions, advancing practical video understanding beyond visible regions.

Abstract

Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo-depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.

Paper Structure

This paper contains 28 sections, 3 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: In this work, we tackle the problem of video amodal segmentation and content completion: given a modal (visible) object sequence in a video, we develop a two-stage method that generates its amodal (visible + invisible) masks and RGB content. We capitalize on the shape and temporal consistency priors baked into video foundation models because of their large-scale pretraining. Finetuning these models enables us to infer complete shapes and RGB details of objects that undergo occlusion. Our method is effectively able to handle severe occlusions and generalizes across diverse object categories, achieving state-of-the-art results on synthetic and real-world datasets. We show one such example of an unseen deformable object category 'laptop' that undergoes a complete occlusion in the highlighted frame.
  • Figure 2: Model pipeline for amodal segmentation and content completion. The first stage of our pipeline generates amodal masks $\{\hat{\mathcal{A}}_t\}$ for an object, given its modal masks $\{\mathcal{M}_t\}$ and pseudo-depth of the scene $\{\mathcal{D}_t\}$ (which is obtained by running a monocular depth estimator on RGB video sequence $\{\mathcal{I}_t\}$ ). The predicted amodal masks from the first stage are then sent as input to the second stage, along with the modal RGB content of the occluded object in consideration. The second stage then inpaints the occluded region and outputs the amodal RGB content $\{ \hat{\mathcal{C}}_t\}$ for the occluded object. Both stages employ a conditional latent diffusion framework with a 3D UNet backbone svd. Conditionings are encoded via a VAE encoder into latent space, concatenated, and processed by a 3D UNet with interleaved spatial and temporal blocks. CLIP embeddings of $\{\mathcal{M}_t\}$ and the modal RGB content provide cross-attention cues for the first and second stage respectively. Finally, the VAE decoder translates outputs back to pixel space.
  • Figure 3: Modal-amodal RGB training pair for content completion. The left frame displays the partially occluded modal RGB content, generated by overlaying amodal masks (black regions) onto the amodal object to disrupt its visual integrity. The right frame shows the original, unoccluded amodal RGB object.
  • Figure 4: Comparison across visibility levels on SAIL-VOS. Our method outperforms the second-best image and video amodal segmentation methods across all visibility ranges (we use Top-1 metrics). This highlights the ability of our method to handle heavy occlusions, and understand when an object is not occluded.
  • Figure 5: Temporal consistency comparison with an image amodal segmentation method. We highlight the lack of temporal coherence in a single-frame diffusion based method, pix2gestalt, for both the predicted amodal segmentation mask and the RGB content for the occluded person in the example shown. By leveraging temporal priors, our approach achieves significantly higher temporal consistency across occlusions.
  • ...and 15 more figures