Learning Modality Knowledge Alignment for Cross-Modality Transfer
Wenxuan Ma, Shuang Li, Lincan Cai, Jingxuan Kang
TL;DR
This work addresses why cross-modality transfer often underperforms by formalizing modality semantic knowledge discrepancy as a misalignment between conditional distributions $P(Y|X)$. It introduces MoNA, a bi-level meta-learning method that learns a target embedder to transform target data into a shared space that aligns with the source modality before full finetuning. The approach is validated on NAS-Bench-360, PDEBench, and additional tasks, showing consistent improvements over standard finetuning and existing cross-modal methods, with ablations confirming the importance of the embedder warmup and the two-stage objective. MoNA thus provides a principled, scalable framework to enhance cross-modality transfer with practical gains across diverse domains.
Abstract
Cross-modality transfer aims to leverage large pretrained models to complete tasks that may not belong to the modality of pretraining data. Existing works achieve certain success in extending classical finetuning to cross-modal scenarios, yet we still lack understanding about the influence of modality gap on the transfer. In this work, a series of experiments focusing on the source representation quality during transfer are conducted, revealing the connection between larger modality gap and lesser knowledge reuse which means ineffective transfer. We then formalize the gap as the knowledge misalignment between modalities using conditional distribution P(Y|X). Towards this problem, we present Modality kNowledge Alignment (MoNA), a meta-learning approach that learns target data transformation to reduce the modality knowledge discrepancy ahead of the transfer. Experiments show that out method enables better reuse of source modality knowledge in cross-modality transfer, which leads to improvements upon existing finetuning methods.
