Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View
Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu Chen, Miaoyu Wang, Stan Z. Li
TL;DR
The paper tackles the scarcity of aligned multimodal data in specialized domains by reframing semi-supervised multimodal alignment as a manifold-matching problem. It introduces Gentle-CLIP, a two-stream CLIP-based framework that leverages a coarse-grained MK-MMD term, a fine-grained semantic density distribution loss (SDD), and self-supervised distribution stability to extract implicit semantic alignment from large pools of unmatched data, while still utilizing a limited set of matched pairs. Through extensive experiments in protein representation, remote sensing, and vision-language tasks, Gentle-CLIP demonstrates strong performance gains, including notable improvements in zero-shot and retrieval benchmarks and even surpassing full supervision baselines in some protein tasks. The work offers a scalable, domain-agnostic approach to multimodal alignment with reduced dependence on strictly paired data, potentially expanding the applicability of multimodal models across specialized fields.
Abstract
Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic alignment from unpaired multimodal data, thereby reducing the reliance on numerous strictly matched pairs. Furthermore, we apply coarse-grained modality adaptation and unimodal self-supervised guidance to narrow the gaps between modality spaces and improve the stability of representation distributions. Extensive experiments conducted on a range of tasks in various fields, including protein analysis, remote sensing, and the general vision-language field, validate the efficacy of our proposed Set-CLIP method. Especially with no paired data for supervised training, Set-CLIP is still outstanding, which brings an improvement of 144.83% over CLIP.
