Table of Contents
Fetching ...

SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang

TL;DR

SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines and can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.

Abstract

Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.

SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

TL;DR

SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines and can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.

Abstract

Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
Paper Structure (5 sections, 1 equation, 13 figures, 4 tables)

This paper contains 5 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Transferability Across Modalities.a, the number of natural images is significantly larger than images in other modalities in different areas, including medical imaging, thermal images, depth images, and polarization images. b, natural images can train vision foundation models, which can be applied to achieve strong performance on different downstream tasks. c, it is very challenging for other modalities to benefit from training foundation models due to limited data. d, our proposed SimMAT explores the transferability from the pretrained vision foundation model to different modalities.
  • Figure 2: Qualitative Results. We transfer the segment anything ability of SAM to different modalities, including segmentation from depth, thermal, polarization , HHA, and NIR images. The proposed method significantly improves segmentation quality compared to SAM zero-shot and training from scratch.
  • Figure 3: Details of SimMAT.a. SimMAT receives new modality $\mathbf{x}$ as input and pass it through a modality-agnostic transfer layer $m$ to obtain an embedding $\mathbf{e}$. The embedding matches the dimension of a pretrained foundation model $f$, and then we obtain the output $\mathbf{y}$. The input and foundation are designed in a generic formulation for different modalities and foundation models. b. in this work, we select SAM as a representative foundation model for a detailed study.
  • Figure 4: Performance Evaluation on Different Modalities.a. The proposed method SimMAT improves the segmentation performance significantly on all evaluated modalities compared with training the models from scratch. Specifically, SimMAT improves the mIoU from 22.15% to 53.88% for all evaluated modalities on average. Besides, the peak performance between finetuning and parameter-efficient finetuning is similar. b. Results on Pseudo New Modalities. We combine natural images with a novel image modality as a pseudo new modality: note that we do not use the information that which three channels are for natural images and which channels are for new modalities. For example, our MAT is effective in improving the finetuning performance on all evaluated pseudo new modalities. Besides, the peak performance between finetuning and parameter-efficient finetuning is similar. c. We provide controlled experiments for different finetuning strategies on new modalities. Parameter-efficient finetuning strategies can achieve comparable performance compared with full finetuning by using much less trainable parameters.
  • Figure 5: The Effect of Learning Rate and Training Data Size. The models are evaluated on the polarization modality. a. the full fine-tuning and parameter efficient tuning achieve peak performance in different learning rates. b. increasing the scale of training data brings consistent performance improvement across different training strategies.
  • ...and 8 more figures