AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation

Jiacheng Lin; Jiajun Chen; Kailun Yang; Alina Roitberg; Siyu Li; Zhiyong Li; Shutao Li

AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation

Jiacheng Lin, Jiajun Chen, Kailun Yang, Alina Roitberg, Siyu Li, Zhiyong Li, Shutao Li

TL;DR

AdaptiveClick is the first transformer-based, mask-adaptive segmentation framework for IIS, and the key ingredient of the method is the click-aware mask-adaptive transformer decoder (CAMD), which enhances the interaction between click and image features.

Abstract

Interactive Image Segmentation (IIS) has emerged as a promising technique for decreasing annotation time. Substantial progress has been made in pre- and post-processing for IIS, but the critical issue of interaction ambiguity, notably hindering segmentation quality, has been under-researched. To address this, we introduce AdaptiveClick -- a click-aware transformer incorporating an adaptive focal loss that tackles annotation inconsistencies with tools for mask- and pixel-level ambiguity resolution. To the best of our knowledge, AdaptiveClick is the first transformer-based, mask-adaptive segmentation framework for IIS. The key ingredient of our method is the Click-Aware Mask-adaptive transformer Decoder (CAMD), which enhances the interaction between click and image features. Additionally, AdaptiveClick enables pixel-adaptive differentiation of hard and easy samples in the decision space, independent of their varying distributions. This is primarily achieved by optimizing a generalized Adaptive Focal Loss (AFL) with a theoretical guarantee, where two adaptive coefficients control the ratio of gradient values for hard and easy pixels. Our analysis reveals that the commonly used Focal and BCE losses can be considered special cases of the proposed AFL. With a plain ViT backbone, extensive experimental results on nine datasets demonstrate the superiority of AdaptiveClick compared to state-of-the-art methods. The source code is publicly available at https://github.com/lab206/AdaptiveClick.

AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation

TL;DR

Abstract

Paper Structure (29 sections, 22 equations, 9 figures, 15 tables)

This paper contains 29 sections, 22 equations, 9 figures, 15 tables.

Introduction
Related Work
Architecture of Interactive Image Segmentation
Loss Function of Interactive Image Segmentation
Method
Deficiency of Existing IIS Methods
Inter-class Click Ambiguity Resolution
Pixel-level Multi-scale Mask Transformer Decoder
Click-Aware Mask-adaptive Transformer Decoder
Mask-adaptive Matching Strategy
Intra-class Click Ambiguity Optimization
Adaptive Difficulty Adjustment
Adaptive Gradient Representation
Adaptive Focal Loss
Model Optimization
...and 14 more sections

Figures (9)

Figure 1: Illustration of the proposed method compared with the existing mask-fixed IIS methods. In mask-fixed IIS methods, only a single mask is generated given the input. In contrast, our mask-adaptive AdaptiveClick can produce multiple candidate masks ($\mathbf{P}_\mathrm{1}\!\!\sim\!\!\mathbf{P}_\mathrm{n}$) to address possible ambiguities introduced by user clicks. The model then selects the optimal combination between the Ground Truth (GT) and Probability Map (PM). Finally, Adaptive Focal Loss (AFL) adaptively adjusts the optimal combination to produce a higher-quality mask. Here, $\mathbf{P}_{\mathrm{t}}$ denotes the confidence of the pixel in the sample, with darker colors indicating more hard to segment and vice versa.
Figure 2: An illustration of AdaptiveClick. First, clicks with the previous mask and image ($\mathbf{I}$) features are obtained via patch embedding, then fused by addition. Second, the fusion features are obtained by Encoder, and the pixel features $\mathbf{F}_0$, $\mathbf{F}_1$, and $\mathbf{F}_2$ of different dimensions are obtained by Pixel-level Multi-scale Mask transformer Decoder (PMMD). Third, $\mathbf{F}^{'}_0, \mathbf{F}^{'}_1, \mathbf{F}^{'}_2$, and $\mathbf{F}_3$ and clicks are jointly input into the designed Click-Aware Mask-adaptive transformer Decoder (CAMD). Fourth, CAMD generates $n$ corresponding masks $\hat{p}$ based on each click and then completes the optimization training process with ADA and AGR proposed in Adaptive Focal Loss (AFL). Finally, the obtained mask sequences are passed through post-processing to output the final mask.
Figure 3: Difficulty confidence visualization of different loss functions on the SBD majumder2019SBD training dataset. From left to right are the image, the ground truth, and the $\mathbf{P}_{\mathrm{t}}$ plot of BCE yi2004automated, FL lin2017focal, PL leng2022polyloss, and AFL, respectively.
Figure 4: A visualization plot of the gradient correction of FL lin2017focal by the ADA and AGR is shown
Figure 5: Segmentation results on natural and medical datasets. The backbone is ViT-B trained on the SBD dataset majumder2019SBD. The probability maps are shown in blue; the masks are overlaid in red on the original images. The clicks are shown as green (positive click) or red (negative click) dots on the image.
...and 4 more figures

AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation

TL;DR

Abstract

AdaptiveClick: Clicks-aware Transformer with Adaptive Focal Loss for Interactive Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)