Table of Contents
Fetching ...

SkinFormer: Learning Statistical Texture Representation with Transformer for Skin Lesion Segmentation

Rongtao Xu, Changwei Wang, Jiguang Zhang, Shibiao Xu, Weiliang Meng, Xiaopeng Zhang

TL;DR

A transFormer network (SkinFormer) is proposed that efficiently extracts and fuses statistical texture representation for Skin lesion segmentation and enhances the statistical texture of multi-scale features.

Abstract

Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans\textbf{Former} network (\textbf{SkinFormer}) that efficiently extracts and fuses statistical texture representation for \textbf{Skin} lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer's global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. {Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93.2\% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future.} Our code is available at https://github.com/Rongtao-Xu/SkinFormer.

SkinFormer: Learning Statistical Texture Representation with Transformer for Skin Lesion Segmentation

TL;DR

A transFormer network (SkinFormer) is proposed that efficiently extracts and fuses statistical texture representation for Skin lesion segmentation and enhances the statistical texture of multi-scale features.

Abstract

Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans\textbf{Former} network (\textbf{SkinFormer}) that efficiently extracts and fuses statistical texture representation for \textbf{Skin} lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer's global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. {Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93.2\% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future.} Our code is available at https://github.com/Rongtao-Xu/SkinFormer.
Paper Structure (21 sections, 16 equations, 8 figures, 8 tables)

This paper contains 21 sections, 16 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Examples of structural and statistical textures and our results. (a) shows the original image. (b) shows the structure texture extracted by the typical convolutional neural network long2015fully. (c) displays the histogram (statistical texture information) of the original image. (d) Our SkinFormer's results. The red contours show the predicted results and the green contours show the ground truth.
  • Figure 2: Overview of the our statistical texture transformer network (SkinFormer) with Kurtosis-guided Statistical Counting Operator, Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer. We adopt a U-shaped network structure to extract multi-level features. Then, the high-level features are fed into the Statistical Texture Fusion Transformer to obtain the statistical texture fusion feature $O^{fusion}$ and quantized intensity embedding $S$. The multi-scale low-level features and $S$ are fed into the Statistical Texture Enhance Transformer to enhance the ability of texture representation extraction. The above two transformer-based modules both use comprehensive attention to improve the representation ability of features.
  • Figure 3: The illustrations of Kurtosis-guided Statistical Counting Operator. By adjusting the input feature map $F$'s channel, the feature aggregation map $F^{a}$ is produced. It is then reshaped to have the size of $\mathbb{R}^{(H\times W)}$. We quantize the reshaped features into $N$ levels $(W_{1}, W_{2}, ..., W_{N})$. We use the kurtosis of $F^{a}$ as weights and apply $\varphi (\cdot)$ to obtain the quantized intensity embedding $S$.
  • Figure 4: The illustrations of Statistical Texture Fusion Transformer. The Statistical Texture Fusion Transformer includes Kurtosis-guided Statistical Counting Operator, comprehensive attention, local-window self-attention and gating mechanisms, and is designed to efficiently fuse structural texture information and statistical texture information of images.
  • Figure 5: The illustrations of Statistical Texture Enhance Transformer. The Statistical Texture Enhance Transformer is mainly divided into two parts: (a) Multi-scale embedding enhancement and (b) Texture-enhanced FFN. The former obtains the inputs (Q, K, V) of local window self-attention by enhancing the statistical texture of shallow features and high-level features at different scales, where the statistical texture of high-level features is further enhanced by the comprehensive attention. The latter increases the receptive field by dilated convolution, aiming to enhance the ability of texture representation extraction.
  • ...and 3 more figures