Table of Contents
Fetching ...

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Qifeng Chen, Zhaoxiang Zhang

TL;DR

SimCMF tackles the challenge of transferring vision foundation models trained on RGB imagery to imaging modalities with limited data by introducing a simple cross-modal alignment module paired with a backbone like SAM. It systematically analyzes alignment design and fine-tuning strategies, demonstrating that a frozen pretrained embedding plus a small nonlinear cross-modal adapter can bridge modality gaps effectively. The approach, evaluated on the newly built AIMS benchmark, yields substantial gains in segmentation performance (average mIoU up to 53.88%) over training from scratch and competing baselines, with parameter-efficient fine-tuning (e.g., LoRA, MLP Adapter) achieving similar results to full fine-tuning but at far lower cost. These results suggest that vision foundation models can be flexibly repurposed for diverse sensors and modalities, enabling broader applicability of foundation-model capabilities in domains with scarce data.

Abstract

Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at https://github.com/mt-cly/SimCMF

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

TL;DR

SimCMF tackles the challenge of transferring vision foundation models trained on RGB imagery to imaging modalities with limited data by introducing a simple cross-modal alignment module paired with a backbone like SAM. It systematically analyzes alignment design and fine-tuning strategies, demonstrating that a frozen pretrained embedding plus a small nonlinear cross-modal adapter can bridge modality gaps effectively. The approach, evaluated on the newly built AIMS benchmark, yields substantial gains in segmentation performance (average mIoU up to 53.88%) over training from scratch and competing baselines, with parameter-efficient fine-tuning (e.g., LoRA, MLP Adapter) achieving similar results to full fine-tuning but at far lower cost. These results suggest that vision foundation models can be flexibly repurposed for diverse sensors and modalities, enabling broader applicability of foundation-model capabilities in domains with scarce data.

Abstract

Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at https://github.com/mt-cly/SimCMF

Paper Structure

This paper contains 20 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Transferability Across Modalities.a, the number of natural images is significantly larger than images in other modalities in different areas, including medical imaging, thermal images, depth images, and polarization images. b, natural images can train vision foundation models, which can be applied to achieve strong performance on different downstream tasks. c, it is very challenging for other modalities to benefit from training foundation models due to limited data. d, our proposed SimCMF explores the transferability from the pretrained vision foundation model to different imaging modalities.
  • Figure 2: SimCMF Conceptual Overview. SimCMF receives new modality $\mathbf{x}$ as input and pass it through a cross-modal alignment module to obtain an embedding. The embedding matches the dimension of a pretrained foundation model backbone, and then we obtain the output $\mathbf{y}$. The input and foundation are designed in a generic formulation for different input modalities and foundation models. In this work, we select SAM as a representative foundation model for a detailed study.
  • Figure 3: Qualitative Results. We transfer the segment anything ability of SAM to different modalities, including segmentation from depth, thermal, polarization, HHA, and NIR images. The proposed method significantly improves segmentation quality compared to SAM zero-shot and training from scratch.
  • Figure 4: Exploring Cross-modal Alignment Module. Randomly initializing a patch embedding for each modality leads to the worst result. A simple linear layer with the pretrained embedding layer can improve the performance already. Interestingly, the results would be better if we frozen the embedding layer. Introducing the nonlinearity is beneficial for the transfer performance. All models are trained with a parameter-efficient fine-tuning strategy. The experiments here are conducted on polarization datasets, and we also validate the effectiveness of these designs on other modalities.
  • Figure 5: The Effect of Learning Rate and Training Data Size. The models are evaluated on the polarization modality. a. the full fine-tuning and parameter efficient tuning achieve peak performance in different learning rates. b. increasing the scale of training data brings consistent performance improvement across different training strategies.
  • ...and 8 more figures