Learning Modality Knowledge Alignment for Cross-Modality Transfer

Wenxuan Ma; Shuang Li; Lincan Cai; Jingxuan Kang

Learning Modality Knowledge Alignment for Cross-Modality Transfer

Wenxuan Ma, Shuang Li, Lincan Cai, Jingxuan Kang

TL;DR

This work addresses why cross-modality transfer often underperforms by formalizing modality semantic knowledge discrepancy as a misalignment between conditional distributions $P(Y|X)$. It introduces MoNA, a bi-level meta-learning method that learns a target embedder to transform target data into a shared space that aligns with the source modality before full finetuning. The approach is validated on NAS-Bench-360, PDEBench, and additional tasks, showing consistent improvements over standard finetuning and existing cross-modal methods, with ablations confirming the importance of the embedder warmup and the two-stage objective. MoNA thus provides a principled, scalable framework to enhance cross-modality transfer with practical gains across diverse domains.

Abstract

Cross-modality transfer aims to leverage large pretrained models to complete tasks that may not belong to the modality of pretraining data. Existing works achieve certain success in extending classical finetuning to cross-modal scenarios, yet we still lack understanding about the influence of modality gap on the transfer. In this work, a series of experiments focusing on the source representation quality during transfer are conducted, revealing the connection between larger modality gap and lesser knowledge reuse which means ineffective transfer. We then formalize the gap as the knowledge misalignment between modalities using conditional distribution P(Y|X). Towards this problem, we present Modality kNowledge Alignment (MoNA), a meta-learning approach that learns target data transformation to reduce the modality knowledge discrepancy ahead of the transfer. Experiments show that out method enables better reuse of source modality knowledge in cross-modality transfer, which leads to improvements upon existing finetuning methods.

Learning Modality Knowledge Alignment for Cross-Modality Transfer

TL;DR

This work addresses why cross-modality transfer often underperforms by formalizing modality semantic knowledge discrepancy as a misalignment between conditional distributions

. It introduces MoNA, a bi-level meta-learning method that learns a target embedder to transform target data into a shared space that aligns with the source modality before full finetuning. The approach is validated on NAS-Bench-360, PDEBench, and additional tasks, showing consistent improvements over standard finetuning and existing cross-modal methods, with ablations confirming the importance of the embedder warmup and the two-stage objective. MoNA thus provides a principled, scalable framework to enhance cross-modality transfer with practical gains across diverse domains.

Abstract

Paper Structure (25 sections, 12 equations, 6 figures, 11 tables, 2 algorithms)

This paper contains 25 sections, 12 equations, 6 figures, 11 tables, 2 algorithms.

Introduction
Problem Formulation and Analysis
Introduction to basic notations and architecture
Distortion of learned source modality knowledge
Modality semantic knowledge discrepancy
Modality Knowledge Alignment
Embedder Warmup
Learning to Align Modality Knowledge
Experiments
Results on NAS-Bench-360
Results on PDEBench
Results on Several Other Tasks
Analytical Experiments
Related Work
Cross-domain Transfer
...and 10 more sections

Figures (6)

Figure 1: Comparison between in-modality finetuning (a)(b) and cross-modal finetuning (c). Both unimodal and multimodal finetuning are considered to be in-modality finetuning because the target modality is in the scope of the pretrained model's modality. In contrast, cross-modal finetuning exploits the pretrained model on target modalities that the pretrained model is not trained on.
Figure 2: T-SNE visualization showing that the gap between different modalities are not the same. Figure in the middle depicts the embeddings of CIFAR-10 generated by an ImageNet pretrained Swin Transformer. The rest four figures are the embeddings of CIFAR-10 generated by the same model after being finetuned on different modalities. None of these models are trained on CIFAR-10 directly. Nevertheless, finetuning on CIFAR-100 and Spherical improve the visual representation from pretrained model while finetuning on NinaPro and FSD50K distort it. Davies-Bouldin indexes are shown at upper-right corner. Smaller index means better clustering.
Figure 3: Linear probing results on CIFAR-10 using representations extracted by vision encoders finetuned on four different modalities and with different finetuning methods. "Pretrained" refers to the baseline that directly uses pretrained vision encoder.
Figure 4: Modality knowledge discrepancy between image modality and four target modalities. Computation uses approximation.
Figure 5: The framework of our proposed method. (a) illustrates a single update step of the embedder $\phi_e$ in the first stage of MoNA. The target data is forward propagated first to compute the inner-loop loss $\mathcal{L}_{inner}$, and the gradient is backpropagated to virtually update the full target model. Then, the updated encoder $\boldsymbol{\theta}^{\mathcal{T}^*}_f$ receives source data embeddings from pretrained source embedder $\boldsymbol{\theta}_{e_0}^{\mathcal{S}}$, and the outer-loop loss is computed using source features. Finally, the outer-loop gradient is used to update the embedder while the virtually updated model is discarded. (b) illustrated the bi-level optimization where the outer-loop updates $\boldsymbol{\phi}_e$ according to inner-loop results.
...and 1 more figures

Theorems & Definitions (1)

Definition 2.1

Learning Modality Knowledge Alignment for Cross-Modality Transfer

TL;DR

Abstract

Learning Modality Knowledge Alignment for Cross-Modality Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)