Table of Contents
Fetching ...

ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

Minh Tran, Winston Bounsavy, Khoa Vo, Anh Nguyen, Tri Nguyen, Ngan Le

TL;DR

ShapeFormer tackles Amodal Instance Segmentation (AIS) by decoupling the visible-to-amodal transition and leveraging category-specific shape priors. It comprises three components: the Visible-Occluding Mask Head for precise visible segmentation; the Category-Specific Shape Prior Retriever that outputs a shape prior $M^i_{a_prior}$ using a category-conditioned vector-quantized autoencoder; and the Shape-prior Amodal Mask Head with shape-prior masked attention to predict the amodal and occluded masks guided by the retrieved prior. The model is trained end-to-end with multi-task losses and includes occlusion-aware augmentation for retriever pretraining. On four AIS benchmarks (KINS, COCOA, COCOA-cls, D2SA), ShapeFormer consistently achieves state-of-the-art results, outperforming prior methods in both visible and amodal AP; the accompanying ablations validate the importance of visible-to-amodal decoupling, the Cat-SP retriever, and the shape-prior masked attention. Code is available at the project URL.

Abstract

Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: \url{https://github.com/UARK-AICV/ShapeFormer}

ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

TL;DR

ShapeFormer tackles Amodal Instance Segmentation (AIS) by decoupling the visible-to-amodal transition and leveraging category-specific shape priors. It comprises three components: the Visible-Occluding Mask Head for precise visible segmentation; the Category-Specific Shape Prior Retriever that outputs a shape prior using a category-conditioned vector-quantized autoencoder; and the Shape-prior Amodal Mask Head with shape-prior masked attention to predict the amodal and occluded masks guided by the retrieved prior. The model is trained end-to-end with multi-task losses and includes occlusion-aware augmentation for retriever pretraining. On four AIS benchmarks (KINS, COCOA, COCOA-cls, D2SA), ShapeFormer consistently achieves state-of-the-art results, outperforming prior methods in both visible and amodal AP; the accompanying ablations validate the importance of visible-to-amodal decoupling, the Cat-SP retriever, and the shape-prior masked attention. Code is available at the project URL.

Abstract

Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: \url{https://github.com/UARK-AICV/ShapeFormer}
Paper Structure (21 sections, 8 equations, 10 figures, 8 tables)

This paper contains 21 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison between our ShapeFormer and existing SOTA approaches. (a) AIS setting, which takes a RoI feature as input and returns four masks including occluding, occluded, visible and amodal. (b) ASN qi2019amodal: bidirectional learning at multi-level coding via feature concatenation. (c) VRSP-Net xiao2021amodal: bidirectional learning at mask head via feature concatenation. (d) AISFormer tran2022aisformer: bidirectional learning at embeddings via self-attention. (e) Our ShapeFormer omits the amodal-to-visible transition, leverages the precise visible feature and shape prior knowledge to predict amodal mask.
  • Figure 2: The overview pipeline illustrates the integration of our ShapeFormer as the amodal mask head within an object detection framework. The input image $\mathbf{I}$ goes through a backbone followed by an object detector to predict the regions of interest (RoI) and extract their corresponding feature. These RoI features are then processed through the proposed ShapeFormer (\ref{['fig:shapeformer_maskhead']}) to obtain the desired output AIS masks.
  • Figure 3: The pipeline of our ShapeFormer consisting of three main components of Visible-Occluding (Vis-Occ) Mask Head, Shape-prior Amodal (SPA) Mask Head, and Category-specific Shape Prior (Cat-SP) Retriever. Feat denotes feature.
  • Figure 4: Detailed architecture. (a): Vis-Occ Transformer Decoder models the relation between visible mask and occluding mask. (b): Amodal Transformer Decoder with shape-prior masked attention models the relation between amodal mask and occluded mask.
  • Figure 5: Flowchart of Cat-SP Retriever. Input is visible mask $\mathbf{M}_v^i$ and its class label $c^i$. Output is category-specific shape prior $\mathbf{M}^i_{a\_{prior}} = f_S (\mathbf{M}_v^i, c^i)$.
  • ...and 5 more figures