ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation
Minh Tran, Winston Bounsavy, Khoa Vo, Anh Nguyen, Tri Nguyen, Ngan Le
TL;DR
ShapeFormer tackles Amodal Instance Segmentation (AIS) by decoupling the visible-to-amodal transition and leveraging category-specific shape priors. It comprises three components: the Visible-Occluding Mask Head for precise visible segmentation; the Category-Specific Shape Prior Retriever that outputs a shape prior $M^i_{a_prior}$ using a category-conditioned vector-quantized autoencoder; and the Shape-prior Amodal Mask Head with shape-prior masked attention to predict the amodal and occluded masks guided by the retrieved prior. The model is trained end-to-end with multi-task losses and includes occlusion-aware augmentation for retriever pretraining. On four AIS benchmarks (KINS, COCOA, COCOA-cls, D2SA), ShapeFormer consistently achieves state-of-the-art results, outperforming prior methods in both visible and amodal AP; the accompanying ablations validate the importance of visible-to-amodal decoupling, the Cat-SP retriever, and the shape-prior masked attention. Code is available at the project URL.
Abstract
Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: \url{https://github.com/UARK-AICV/ShapeFormer}
