Table of Contents
Fetching ...

AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets

Siyi Du, Nourhan Bayasi, Ghassan Hamarneh, Rafeef Garbi

TL;DR

Skin lesion segmentation benefits from Vision Transformers but suffers from data scarcity in small datasets. AViT addresses this by freezing a pre-trained ViT backbone while inserting adapters in transformer layers and adding a shallow CNN prompt generator to guide segmentation, resulting in only about $13.6$M trainable parameters (≈$13.7%$ of the total). On four public datasets, AViT delivers competitive or superior performance to state-of-the-art SLS methods and many PEFT baselines with markedly reduced trainable parameters and memory usage, demonstrating effective knowledge transfer with minimal adaptation. This approach provides a scalable, efficient route to deploying ViT-based segmentation in data-limited medical imaging settings, with code available for reproduction.

Abstract

Skin lesion segmentation (SLS) plays an important role in skin lesion analysis. Vision transformers (ViTs) are considered an auspicious solution for SLS, but they require more training data compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive. In this paper, we propose AViT, a novel efficient strategy to mitigate ViTs' data-hunger by transferring any pre-trained ViTs to the SLS task. Specifically, we integrate lightweight modules (adapters) within the transformer layers, which modulate the feature representation of a ViT without updating its pre-trained weights. In addition, we employ a shallow CNN as a prompt generator to create a prompt embedding from the input image, which grasps fine-grained information and CNN's inductive biases to guide the segmentation task on small datasets. Our quantitative experiments on 4 skin lesion datasets demonstrate that AViT achieves competitive, and at times superior, performance to SOTA but with significantly fewer trainable parameters. Our code is available at https://github.com/siyi-wind/AViT.

AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets

TL;DR

Skin lesion segmentation benefits from Vision Transformers but suffers from data scarcity in small datasets. AViT addresses this by freezing a pre-trained ViT backbone while inserting adapters in transformer layers and adding a shallow CNN prompt generator to guide segmentation, resulting in only about M trainable parameters (≈ of the total). On four public datasets, AViT delivers competitive or superior performance to state-of-the-art SLS methods and many PEFT baselines with markedly reduced trainable parameters and memory usage, demonstrating effective knowledge transfer with minimal adaptation. This approach provides a scalable, efficient route to deploying ViT-based segmentation in data-limited medical imaging settings, with code available for reproduction.

Abstract

Skin lesion segmentation (SLS) plays an important role in skin lesion analysis. Vision transformers (ViTs) are considered an auspicious solution for SLS, but they require more training data compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive. In this paper, we propose AViT, a novel efficient strategy to mitigate ViTs' data-hunger by transferring any pre-trained ViTs to the SLS task. Specifically, we integrate lightweight modules (adapters) within the transformer layers, which modulate the feature representation of a ViT without updating its pre-trained weights. In addition, we employ a shallow CNN as a prompt generator to create a prompt embedding from the input image, which grasps fine-grained information and CNN's inductive biases to guide the segmentation task on small datasets. Our quantitative experiments on 4 skin lesion datasets demonstrate that AViT achieves competitive, and at times superior, performance to SOTA but with significantly fewer trainable parameters. Our code is available at https://github.com/siyi-wind/AViT.
Paper Structure (6 sections, 4 equations, 2 figures, 2 tables)

This paper contains 6 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Architecture of AViT: (a) Model overview with its prompt generator (a shallow CNN network), a large pre-trained ViT backbone with adapters, and a compact decoder. (b) Model details. (c) Details of a transformer layer with adapters. (d) Details of our adapters. During training, all modules in (b,c,d) contoured with blue borders are frozen, which encompasses 86.3% of AViT's parameters.
  • Figure 2: Visual comparison with different SOTA methods. The green contours are the ground truth, and the red contours are the segmentation results.