Table of Contents
Fetching ...

Segment Anything, Even Occluded

Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, Hwann-Tzong Chen

TL;DR

Amodal segmentation benefits from decoupling detection and mask decoding. SAMEO retools EfficientSAM as a flexible amodal mask decoder that can pair with both modal and amodal detectors, enabling predictions of occluded object extents. To address data scarcity, the authors introduce Amodal-LVIS, a 300K-image synthetic dataset built from LVIS/LVVIS with paired occluded and unoccluded masks and a dual-annotation scheme. Empirical results show state-of-the-art zero-shot performance on COCOA-cls and D2SA, with strong improvements across multiple front-ends and robust generalization to unseen scenarios, highlighting the practical impact of combining foundation-model decoders with curated data for amodal segmentation.

Abstract

Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front-end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre-existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front-end detectors to enable mask prediction even for partially occluded objects. Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal-LVIS, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.

Segment Anything, Even Occluded

TL;DR

Amodal segmentation benefits from decoupling detection and mask decoding. SAMEO retools EfficientSAM as a flexible amodal mask decoder that can pair with both modal and amodal detectors, enabling predictions of occluded object extents. To address data scarcity, the authors introduce Amodal-LVIS, a 300K-image synthetic dataset built from LVIS/LVVIS with paired occluded and unoccluded masks and a dual-annotation scheme. Empirical results show state-of-the-art zero-shot performance on COCOA-cls and D2SA, with strong improvements across multiple front-ends and robust generalization to unseen scenarios, highlighting the practical impact of combining foundation-model decoders with curated data for amodal segmentation.

Abstract

Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front-end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre-existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front-end detectors to enable mask prediction even for partially occluded objects. Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal-LVIS, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.

Paper Structure

This paper contains 37 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Amodal segmentation examples: The top row shows the original images. The middle row displays EfficientSAM predicted modal masks that only cover the visible parts of objects. The bottom row illustrates amodal masks that reveal the complete object shapes predicted by our method, SAMEO---a Segment Anything Model Even under Occlusion.
  • Figure 2: Overview of our amodal segmentation pipeline. Given an input image, existing object detectors first generate either modal boxes (showing visible regions) or amodal boxes (showing complete object extent). Our SAMEO then processes these detections to produce amodal masks that recover the full shape of objects, including occluded parts.
  • Figure 3: Examples of limitations in existing amodal datasets: (a) DYCE and (b) MP3D-amodal show meaningless architectural elements rendered from 3D meshes that dominate the image space, while (c) pix2gestalt contains potentially incomplete amodal masks due to restrictive generation criteria.
  • Figure 4: Amodal-LVIS dataset generation process. From left to right: original image with unoccluded objects, a selected occluder object, and the synthesized image with occlusion. Our dataset includes both the original and the synthesized image for each instance to prevent occlusion bias during training.
  • Figure 5: Qualitative comparison of amodal mask predictions. For each row: SAMEO's amodal prediction (top) with AISFormer box prompts, and AISFormer's prediction (bottom). Our method demonstrates superior mask quality, exhibiting more precise boundary delineation and robust handling of complex occlusion scenarios. Original images used for evaluation are available in supplementary materials.
  • ...and 6 more figures