Table of Contents
Fetching ...

TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models

Rabin Adhikari, Safal Thapaliya, Manish Dhakal, Bishesh Khanal

TL;DR

An open-source benchmarking framework to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes, and advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation.

Abstract

Vision-Language Models (VLMs) have shown impressive performance in vision tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt tuning techniques, including textual, visual, and multimodal prompting, offer efficient alternatives by leveraging learnable prompts. However, their application to Vision-Language Segmentation Models (VLSMs) and evaluation under significant domain shifts remain unexplored. This work presents an open-source benchmarking framework, TuneVLSeg, to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes. TuneVLSeg includes $6$ prompt tuning strategies on various prompt depths used in $2$ VLSMs totaling of $8$ different combinations. We test various prompt tuning on $8$ diverse medical datasets, including $3$ radiology datasets (breast tumor, echocardiograph, chest X-ray pathologies) and $5$ non-radiology datasets (polyp, ulcer, skin cancer), and two natural domain segmentation datasets. Our study found that textual prompt tuning struggles under significant domain shifts, from natural-domain images to medical data. Furthermore, visual prompt tuning, with fewer hyperparameters than multimodal prompt tuning, often achieves performance competitive to multimodal approaches, making it a valuable first attempt. Our work advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation. The source code is available at https://github.com/naamiinepal/tunevlseg.

TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models

TL;DR

An open-source benchmarking framework to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes, and advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation.

Abstract

Vision-Language Models (VLMs) have shown impressive performance in vision tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt tuning techniques, including textual, visual, and multimodal prompting, offer efficient alternatives by leveraging learnable prompts. However, their application to Vision-Language Segmentation Models (VLSMs) and evaluation under significant domain shifts remain unexplored. This work presents an open-source benchmarking framework, TuneVLSeg, to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes. TuneVLSeg includes prompt tuning strategies on various prompt depths used in VLSMs totaling of different combinations. We test various prompt tuning on diverse medical datasets, including radiology datasets (breast tumor, echocardiograph, chest X-ray pathologies) and non-radiology datasets (polyp, ulcer, skin cancer), and two natural domain segmentation datasets. Our study found that textual prompt tuning struggles under significant domain shifts, from natural-domain images to medical data. Furthermore, visual prompt tuning, with fewer hyperparameters than multimodal prompt tuning, often achieves performance competitive to multimodal approaches, making it a valuable first attempt. Our work advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation. The source code is available at https://github.com/naamiinepal/tunevlseg.
Paper Structure (27 sections, 12 equations, 7 figures, 5 tables)

This paper contains 27 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Manual prompting in CLIP and CLIP-based segmentation models.
  • Figure 2: Overview of various prompt tuning methods. In the first row, there are unimodal prompt tuning methods and the second row shows the multimodal prompt tuning methods. The prompting for only the first layer is shown here, the same concept is applicable when the prompt tuning is done for multiple transformer blocks.
  • Figure 3: Multimodal Prompt Tuning Architecture. To simplify, the projection layers for conditioning prompts from one mode to another are not shown here. Likewise, for unimodal techniques, only either of the prompt modalities is fed into the model.
  • Figure 4: t-SNE van2008visualizing plots of phrases and images of all the datasets. Here, Phrasecut wu2020phrasecut is the dataset on which CLIPSeg was pretrained, Cityscapes cordts2016cityscapes and Pascal VOC2012 everingham2012pascal are the open-domain datasets, while others correspond to the medical domain. In \ref{['fig:image_tsne']}, we can see the overlap of clusters for the open domain datasets, and the medical datasets have formed separate small clusters.
  • Figure 5: Test Dice vs. Prompt Depth for Textual Tuning of all Datasets
  • ...and 2 more figures