Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

Lincan Cai; Shuang Li; Wenxuan Ma; Jingxuan Kang; Binhui Xie; Zixun Sun; Chengwei Zhu

Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

Lincan Cai, Shuang Li, Wenxuan Ma, Jingxuan Kang, Binhui Xie, Zixun Sun, Chengwei Zhu

TL;DR

PaRe tackles cross-modal fine-tuning under modality gaps and limited target data by introducing Gradually Intermediate Modality Generation through modality-agnostic patch replacement guided by a gate network. The method constructs a curriculum of intermediate modalities, measured by $OTDD$, and trains from source-like to target-like data to improve transferability and stability. Across NAS-Bench-360, PDEBench, and OpenML-CC18, PaRe consistently outperforms ORCA and other baselines, establishing new state-of-the-art results in several tasks. The work demonstrates a scalable, end-to-end strategy for adapting pretrained models to diverse, data-scarce modalities such as PDEs, protein structures, and cosmic rays, with avenues for further enhancement via proxy selection and unlabeled data.

Abstract

Large-scale pretrained models have proven immensely valuable in handling data-intensive modalities like text and image. However, fine-tuning these models for certain specialized modalities, such as protein sequence and cosmic ray, poses challenges due to the significant modality discrepancy and scarcity of labeled data. In this paper, we propose an end-to-end method, PaRe, to enhance cross-modal fine-tuning, aiming to transfer a large-scale pretrained model to various target modalities. PaRe employs a gating mechanism to select key patches from both source and target data. Through a modality-agnostic Patch Replacement scheme, these patches are preserved and combined to construct data-rich intermediate modalities ranging from easy to hard. By gradually intermediate modality generation, we can not only effectively bridge the modality gap to enhance stability and transferability of cross-modal fine-tuning, but also address the challenge of limited data in the target modality by leveraging enriched intermediate modality data. Compared with hand-designed, general-purpose, task-specific, and state-of-the-art cross-modal fine-tuning approaches, PaRe demonstrates superior performance across three challenging benchmarks, encompassing more than ten modalities.

Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

TL;DR

, and trains from source-like to target-like data to improve transferability and stability. Across NAS-Bench-360, PDEBench, and OpenML-CC18, PaRe consistently outperforms ORCA and other baselines, establishing new state-of-the-art results in several tasks. The work demonstrates a scalable, end-to-end strategy for adapting pretrained models to diverse, data-scarce modalities such as PDEs, protein structures, and cosmic rays, with avenues for further enhancement via proxy selection and unlabeled data.

Abstract

Paper Structure (43 sections, 2 equations, 8 figures, 18 tables, 1 algorithm)

This paper contains 43 sections, 2 equations, 8 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Mutil-modal transformers.
In-modality fine-tuning.
Cross-modality fine-tuning.
Curriculum learning.
Cross-modality mixing.
Method
Problem setup.
Architecture design
Source and target embedder.
Custom predictor.
Gradually intermediate modality generation
Modality-agnostic patch replacement
Experiments
...and 28 more sections

Figures (8)

Figure 1: (a) The loss landscapes of models fine-tuned with ORCA orca and PaRe on the Ninapro dataset. (b) The OTDD otdd between the intermediate modality with different $k$ values and source or target modality respectively. (c) Target embeddings ( black dots), intermediate modality embeddings obtained by replacing target patches with different number of source patches ( blue and green dots), and source embeddings ( red dots) visualized using t-SNE. Intermediate modalities effectively bridge the modality gap and enhance the model's transferability and stability.
Figure 2: Framework overview. a) The overall architecture of the model and workflow of our method. b) Patch Replacement (PaRe) module contains three steps: patch scoring using the designed gate network, gradually select top-$k$ source patches and bottom-$k$ target patches and replace the selected target patches with the source patches one by one. c) The architecture of the gate network which contains a Full-Connected (FC) layer and a Sigmoid layer.
Figure 3: Aggregating Table \ref{['tab:nas360']} results using performance profiles dolan2002profiles. The ordinate represents the cumulative distribution of problems solved by the method within a factor $\tau$ of the best performance. Therefore, the closer a curve approaches the top-left corner of the graph, the more capable the method is of solving more problems with minimal performance degradation. PaRe being as a horizontal line means it is always the best.
Figure 4: The visualization of the different numbers of patches selected by random strategy and our gate strategy. Additional visualizations can be found in the Appendix \ref{['appendix:pase']}.
Figure 5: The impact on the results of random and gate strategy on Cosmic dataset with different initial $k$ value. The smaller the initial $k$ value, the larger performance percentage difference between random and gate strategy.
...and 3 more figures

Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

TL;DR

Abstract

Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)