Table of Contents
Fetching ...

Sequential Amodal Segmentation via Cumulative Occlusion Learning

Jiayang Ao, Qiuhong Ke, Krista A. Ehinger

TL;DR

This work addresses the challenge of amodally segmenting multiple occluded objects without relying on object categories and inferring their occlusion order. It introduces a diffusion-based sequential amodal segmentation framework that uses cumulative occlusion learning to accumulate context across layers and produce multiple plausible amodal masks per object. The approach supports unlimited occlusion layers and class-agnostic occluded shapes, generating diverse predictions to reflect uncertainty in hidden regions. Experiments on three robotics-relevant datasets show substantial improvements over diffusion-based and category-specific baselines, highlighting stronger occlusion reasoning and generalization to unseen objects.

Abstract

To fully understand the 3D context of a single image, a visual system must be able to segment both the visible and occluded regions of objects, while discerning their occlusion order. Ideally, the system should be able to handle any object and not be restricted to segmenting a limited set of object classes, especially in robotic applications. Addressing this need, we introduce a diffusion model with cumulative occlusion learning designed for sequential amodal segmentation of objects with uncertain categories. This model iteratively refines the prediction using the cumulative mask strategy during diffusion, effectively capturing the uncertainty of invisible regions and adeptly reproducing the complex distribution of shapes and occlusion orders of occluded objects. It is akin to the human capability for amodal perception, i.e., to decipher the spatial ordering among objects and accurately predict complete contours for occluded objects in densely layered visual scenes. Experimental results across three amodal datasets show that our method outperforms established baselines.

Sequential Amodal Segmentation via Cumulative Occlusion Learning

TL;DR

This work addresses the challenge of amodally segmenting multiple occluded objects without relying on object categories and inferring their occlusion order. It introduces a diffusion-based sequential amodal segmentation framework that uses cumulative occlusion learning to accumulate context across layers and produce multiple plausible amodal masks per object. The approach supports unlimited occlusion layers and class-agnostic occluded shapes, generating diverse predictions to reflect uncertainty in hidden regions. Experiments on three robotics-relevant datasets show substantial improvements over diffusion-based and category-specific baselines, highlighting stronger occlusion reasoning and generalization to unseen objects.

Abstract

To fully understand the 3D context of a single image, a visual system must be able to segment both the visible and occluded regions of objects, while discerning their occlusion order. Ideally, the system should be able to handle any object and not be restricted to segmenting a limited set of object classes, especially in robotic applications. Addressing this need, we introduce a diffusion model with cumulative occlusion learning designed for sequential amodal segmentation of objects with uncertain categories. This model iteratively refines the prediction using the cumulative mask strategy during diffusion, effectively capturing the uncertainty of invisible regions and adeptly reproducing the complex distribution of shapes and occlusion orders of occluded objects. It is akin to the human capability for amodal perception, i.e., to decipher the spatial ordering among objects and accurately predict complete contours for occluded objects in densely layered visual scenes. Experimental results across three amodal datasets show that our method outperforms established baselines.
Paper Structure (13 sections, 12 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 13 sections, 12 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: The cumulative mask and amodal mask predictions for an input image. Our method can generate reliable amodal masks layer by layer and allows multiple objects per layer.
  • Figure 2: Architecture of our model. Our model receives an RGB image as input and predicts multiple plausible amodal masks layer-by-layer, starting with the unoccluded objects and proceeding to deeper occlusion layers. Each layer's mask synthesis receives as input the cumulative occlusion mask from previous layers, thus providing a spatial context for the diffusion process and helping the model better segment the remaining occluded objects.
  • Figure 3: Cumulative guided diffusion. The diffusion process is informed by the input image and the dynamically updated cumulative mask at each depth layer. The diffusion only perturbs the amodal masks, maintaining the contextual and spatial integrity of the image and the corresponding cumulative mask unaltered.
  • Figure 4: (a) Our approach considers the diversity of possible amodal masks, especially for occluded regions (indicated by dashed circles). (b) Example of misjudgement of the order of occluded objects in adjacent layers. Layer 3's prediction reflects Layer 4's ground truth and vice versa. This can also be a challenge for human perception.
  • Figure 5: Comparison of predictions on Intra-AFruit (top) and MUVA (bottom) test image by (b) DIS wolleb2022diffusion (c) CIMD rahman2023ambiguous (d) PLIn ao2024amodal (e) PointRend kirillov2020pointrend and (a) ours, where (b) and (c) are diffusion-based methods. Dashed circles indicate objects that missed being predicted. Others fail to segment objects or provide less plausible amodal masks compared to ours.
  • ...and 3 more figures