Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation
Lincan Cai, Shuang Li, Wenxuan Ma, Jingxuan Kang, Binhui Xie, Zixun Sun, Chengwei Zhu
TL;DR
PaRe tackles cross-modal fine-tuning under modality gaps and limited target data by introducing Gradually Intermediate Modality Generation through modality-agnostic patch replacement guided by a gate network. The method constructs a curriculum of intermediate modalities, measured by $OTDD$, and trains from source-like to target-like data to improve transferability and stability. Across NAS-Bench-360, PDEBench, and OpenML-CC18, PaRe consistently outperforms ORCA and other baselines, establishing new state-of-the-art results in several tasks. The work demonstrates a scalable, end-to-end strategy for adapting pretrained models to diverse, data-scarce modalities such as PDEs, protein structures, and cosmic rays, with avenues for further enhancement via proxy selection and unlabeled data.
Abstract
Large-scale pretrained models have proven immensely valuable in handling data-intensive modalities like text and image. However, fine-tuning these models for certain specialized modalities, such as protein sequence and cosmic ray, poses challenges due to the significant modality discrepancy and scarcity of labeled data. In this paper, we propose an end-to-end method, PaRe, to enhance cross-modal fine-tuning, aiming to transfer a large-scale pretrained model to various target modalities. PaRe employs a gating mechanism to select key patches from both source and target data. Through a modality-agnostic Patch Replacement scheme, these patches are preserved and combined to construct data-rich intermediate modalities ranging from easy to hard. By gradually intermediate modality generation, we can not only effectively bridge the modality gap to enhance stability and transferability of cross-modal fine-tuning, but also address the challenge of limited data in the target modality by leveraging enriched intermediate modality data. Compared with hand-designed, general-purpose, task-specific, and state-of-the-art cross-modal fine-tuning approaches, PaRe demonstrates superior performance across three challenging benchmarks, encompassing more than ten modalities.
