Table of Contents
Fetching ...

Structured Click Control in Transformer-based Interactive Segmentation

Long Xu, Yongquan Chen, Rui Huang, Feng Wu, Shiwu Lai

TL;DR

The paper tackles the lack of robust click-controlled outputs in Transformer-based interactive segmentation by introducing a structured click control framework that learns a graph of interactive tokens from user clicks. It uses a GNN to fuse selected node features via either a GCN or a GAT, and then integrates these structured features into Vision Transformer representations through a dual cross-attention mechanism. The approach, evaluated on multiple baselines and datasets including remote sensing, shows consistent improvements in click efficiency and segmentation robustness, with GAT-based fusion typically outperforming GCN. The work provides a general, transferable module for enhancing transformer-based interactive segmentation and releases code for public use.

Abstract

Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at https://github.com/hahamyt/scc.

Structured Click Control in Transformer-based Interactive Segmentation

TL;DR

The paper tackles the lack of robust click-controlled outputs in Transformer-based interactive segmentation by introducing a structured click control framework that learns a graph of interactive tokens from user clicks. It uses a GNN to fuse selected node features via either a GCN or a GAT, and then integrates these structured features into Vision Transformer representations through a dual cross-attention mechanism. The approach, evaluated on multiple baselines and datasets including remote sensing, shows consistent improvements in click efficiency and segmentation robustness, with GAT-based fusion typically outperforming GCN. The work provides a general, transferable module for enhancing transformer-based interactive segmentation and releases code for public use.

Abstract

Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at https://github.com/hahamyt/scc.
Paper Structure (19 sections, 12 equations, 6 figures, 6 tables)

This paper contains 19 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The performance of various algorithms after multiple clicks in complex scenarios. The red dots represent negative clicks, while the green dots represent positive clicks. (a) represents a class of click-point modeling algorithms represented by FocalClick chen2022focalclick, and (b) represents a class of click-point modeling algorithms represented by SAM. The results indicate that both types of algorithms exhibit issues with uncontrollable network outputs when applied to conventional image data as well as remote sensing data
  • Figure 2: Overall structure of the proposed algorithm. (1). Input of the algorithm is including input image, previous mask, and click maps. The patch embedding module will encoding them as tokens then input to the ViT backbone. (2). The most common used ViT are divided into two types including 'Single-Scale' and 'Multi-Scale', which can adopt the proposed algorithm easily.
  • Figure 3: Overall structure of the GNN Block. "Top-k selection" refers to selecting tokens with scores greater than 0.95, and the number of tokens selected does not exceed 32. In situations where tokens have fewer similarities with the clicked token, the number of selected tokens is also significantly less than 32. In the GNN Block, any graph neural network can be chosen for node feature fusion. In this paper, we select GCN and GAT.
  • Figure 4: Trade off between mIoU and number of clicks
  • Figure 5: Analysis of intermediate variables in the GAT model. PS represents the scores of tokens similar to positive clicks, NS represents the scores of tokens similar to negative clicks, PA represents attention calculated from tokens filtered by PS, NA represents attention calculated from tokens filtered by NS. shallow represents shallow features, while deep represents deep features
  • ...and 1 more figures