AISFormer: Amodal Instance Segmentation with Transformer

Minh Tran; Khoa Vo; Kashu Yamazaki; Arthur Fernandes; Michael Kidd; Ngan Le

AISFormer: Amodal Instance Segmentation with Transformer

Minh Tran, Khoa Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, Ngan Le

TL;DR

This work presents AISFormer, an AIS framework, with a Transformer-based mask head, which explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries.

Abstract

Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer

AISFormer: Amodal Instance Segmentation with Transformer

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 9 figures, 3 tables)

This paper contains 14 sections, 5 equations, 9 figures, 3 tables.

Introduction
Related Work
Instance Segmentation (IS)
Amodal Instance Segmentation (AIS)
Query-based Image Segmentation
Methods
Feature Encoding
Mask transformer decoding
Invisible mask embedding
Mask predicting
Experiments
Datasets, Metrics and Implementation Details
Performance Comparison
Ablations

Figures (9)

Figure 1: An explanation of different mask instances in Amodal Instance Segmentation (AIS). Given a region of interest (ROI) extracted by an object detector, AIS aims to extract both visible and invisible mask instances including occluder, visible, amodal, and invisible.
Figure 2: A comparison between Instance Segmentation (IS) and Amodal Instance Segmentation (AIS). Given an image with ROI (a), IS aims to extract the visible mask instance (b) whereas AIS aims to extract both the visible mask and occluded parts (c).
Figure 3: The overall flowchart of our proposed AISFormer. AISFormer consists of four modules corresponding to (i) feature encoding: after obtaining the region of interest (ROI) feature from the backbone $\mathcal{B}_\phi$ and ROIAlign algorithm $\varphi$, CNN-based layers and a transformer encoder are applied to learn both short-range and long-range features of the given ROI.(ii) mask transformer decoding $\mathcal{D}_\beta$: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding $\mathcal{I}_\theta$ to model the coherence between the amodal and visible masks by computing the invisible mask embedding, and (iv) segmentation to estimate output masks including occluder ($\mathbf{M}_o$), visible ($\mathbf{M}_v$), amodal ($\mathbf{M}_a$)and invisible ($\mathbf{M}_i$).
Figure 4: Illustration network architecture of AISFormer. (a): mask transformer encoder $\mathcal{E}_\alpha$ is designed as one block of self-attention, (b): mask transformer decoder $\mathcal{D}_\beta$ is designed as a combination of one block of self-attention ($\mathcal{A}_{self}$) and one block of cross-attention ($\mathcal{A}_{cross}$) and (c): invisible embedding $\mathcal{I}_\theta$ is designed as an MLP with two hidden layers.
Figure 5: Attention visualization of query embeddings. (a): Input image with four ROIs. (b), (c), (d), (e): attention feature maps of queries in each ROI. For each ROI, from left-right: ROI, occluder query embedding, visible query embedding, and amodal query embedding.
...and 4 more figures

AISFormer: Amodal Instance Segmentation with Transformer

TL;DR

Abstract

AISFormer: Amodal Instance Segmentation with Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (9)