Table of Contents
Fetching ...

Amodal Instance Segmentation with Diffusion Shape Prior Estimation

Minh Tran, Khoa Vo, Tri Nguyen, Ngan Le

TL;DR

Amodal Instance Segmentation remains challenging due to occlusion and the need to predict complete object shapes. The authors propose AISDiff, which integrates a Diffusion Shape Prior Estimation (DiffSP) module that conditions a pretrained diffusion model on the ROI-visible pixels, occluding mask, and object category text to infer a rich shape prior, followed by a Shape Prior Amodal Predictor that refines the amodal mask via attention over the shape prior. The method jointly predicts visible masks, occluding masks, and object category, then uses the diffusion-derived shape prior to produce accurate amodal segmentation, trained with a multi-task loss akin to Mask R-CNN. Experiments on KINS, D2SA, and COCOA-cls show state-of-the-art results and robust ablations demonstrate the value of diffusion priors, diffusion timesteps, and the inclusion of category/occlusion cues for AIS.

Abstract

Amodal Instance Segmentation (AIS) presents an intriguing challenge, including the segmentation prediction of both visible and occluded parts of objects within images. Previous methods have often relied on shape prior information gleaned from training data to enhance amodal segmentation. However, these approaches are susceptible to overfitting and disregard object category details. Recent advancements highlight the potential of conditioned diffusion models, pretrained on extensive datasets, to generate images from latent space. Drawing inspiration from this, we propose AISDiff with a Diffusion Shape Prior Estimation (DiffSP) module. AISDiff begins with the prediction of the visible segmentation mask and object category, alongside occlusion-aware processing through the prediction of occluding masks. Subsequently, these elements are inputted into our DiffSP module to infer the shape prior of the object. DiffSP utilizes conditioned diffusion models pretrained on extensive datasets to extract rich visual features for shape prior estimation. Additionally, we introduce the Shape Prior Amodal Predictor, which utilizes attention-based feature maps from the shape prior to refine amodal segmentation. Experiments across various AIS benchmarks demonstrate the effectiveness of our AISDiff.

Amodal Instance Segmentation with Diffusion Shape Prior Estimation

TL;DR

Amodal Instance Segmentation remains challenging due to occlusion and the need to predict complete object shapes. The authors propose AISDiff, which integrates a Diffusion Shape Prior Estimation (DiffSP) module that conditions a pretrained diffusion model on the ROI-visible pixels, occluding mask, and object category text to infer a rich shape prior, followed by a Shape Prior Amodal Predictor that refines the amodal mask via attention over the shape prior. The method jointly predicts visible masks, occluding masks, and object category, then uses the diffusion-derived shape prior to produce accurate amodal segmentation, trained with a multi-task loss akin to Mask R-CNN. Experiments on KINS, D2SA, and COCOA-cls show state-of-the-art results and robust ablations demonstrate the value of diffusion priors, diffusion timesteps, and the inclusion of category/occlusion cues for AIS.

Abstract

Amodal Instance Segmentation (AIS) presents an intriguing challenge, including the segmentation prediction of both visible and occluded parts of objects within images. Previous methods have often relied on shape prior information gleaned from training data to enhance amodal segmentation. However, these approaches are susceptible to overfitting and disregard object category details. Recent advancements highlight the potential of conditioned diffusion models, pretrained on extensive datasets, to generate images from latent space. Drawing inspiration from this, we propose AISDiff with a Diffusion Shape Prior Estimation (DiffSP) module. AISDiff begins with the prediction of the visible segmentation mask and object category, alongside occlusion-aware processing through the prediction of occluding masks. Subsequently, these elements are inputted into our DiffSP module to infer the shape prior of the object. DiffSP utilizes conditioned diffusion models pretrained on extensive datasets to extract rich visual features for shape prior estimation. Additionally, we introduce the Shape Prior Amodal Predictor, which utilizes attention-based feature maps from the shape prior to refine amodal segmentation. Experiments across various AIS benchmarks demonstrate the effectiveness of our AISDiff.
Paper Structure (23 sections, 1 equation, 6 figures, 6 tables)

This paper contains 23 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overall architecture of AISDiff. AISDiff predicts the visible segmentation mask and the object category while simultaneously addressing occlusion by predicting the occluding mask. Next, these predictions are integrated into the Diffusion Shape Prior Estimation (DiffSP) module to establish the object's shape prior. This shape prior is then utilized by AISDiff to produce the amodal segmentation.
  • Figure 2: Overall process of Diffusion Shape Prior Estimation (DiffSP).
  • Figure 3: Overall design of Shape Prior Amodal Predictor.
  • Figure 4: Qualitative results of AISDiff. Left to right: Input RoI, Visible masks, Occluding masks, Amodal masks. Best viewed in color.
  • Figure 5: Spatial attention map of the Shape Prior Amodal Predictor on the each RoI. Best viewed in color.
  • ...and 1 more figures