Table of Contents
Fetching ...

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu Chen, Miaoyu Wang, Stan Z. Li

TL;DR

The paper tackles the scarcity of aligned multimodal data in specialized domains by reframing semi-supervised multimodal alignment as a manifold-matching problem. It introduces Gentle-CLIP, a two-stream CLIP-based framework that leverages a coarse-grained MK-MMD term, a fine-grained semantic density distribution loss (SDD), and self-supervised distribution stability to extract implicit semantic alignment from large pools of unmatched data, while still utilizing a limited set of matched pairs. Through extensive experiments in protein representation, remote sensing, and vision-language tasks, Gentle-CLIP demonstrates strong performance gains, including notable improvements in zero-shot and retrieval benchmarks and even surpassing full supervision baselines in some protein tasks. The work offers a scalable, domain-agnostic approach to multimodal alignment with reduced dependence on strictly paired data, potentially expanding the applicability of multimodal models across specialized fields.

Abstract

Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic alignment from unpaired multimodal data, thereby reducing the reliance on numerous strictly matched pairs. Furthermore, we apply coarse-grained modality adaptation and unimodal self-supervised guidance to narrow the gaps between modality spaces and improve the stability of representation distributions. Extensive experiments conducted on a range of tasks in various fields, including protein analysis, remote sensing, and the general vision-language field, validate the efficacy of our proposed Set-CLIP method. Especially with no paired data for supervised training, Set-CLIP is still outstanding, which brings an improvement of 144.83% over CLIP.

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

TL;DR

The paper tackles the scarcity of aligned multimodal data in specialized domains by reframing semi-supervised multimodal alignment as a manifold-matching problem. It introduces Gentle-CLIP, a two-stream CLIP-based framework that leverages a coarse-grained MK-MMD term, a fine-grained semantic density distribution loss (SDD), and self-supervised distribution stability to extract implicit semantic alignment from large pools of unmatched data, while still utilizing a limited set of matched pairs. Through extensive experiments in protein representation, remote sensing, and vision-language tasks, Gentle-CLIP demonstrates strong performance gains, including notable improvements in zero-shot and retrieval benchmarks and even surpassing full supervision baselines in some protein tasks. The work offers a scalable, domain-agnostic approach to multimodal alignment with reduced dependence on strictly paired data, potentially expanding the applicability of multimodal models across specialized fields.

Abstract

Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic alignment from unpaired multimodal data, thereby reducing the reliance on numerous strictly matched pairs. Furthermore, we apply coarse-grained modality adaptation and unimodal self-supervised guidance to narrow the gaps between modality spaces and improve the stability of representation distributions. Extensive experiments conducted on a range of tasks in various fields, including protein analysis, remote sensing, and the general vision-language field, validate the efficacy of our proposed Set-CLIP method. Especially with no paired data for supervised training, Set-CLIP is still outstanding, which brings an improvement of 144.83% over CLIP.
Paper Structure (16 sections, 13 equations, 4 figures, 3 tables)

This paper contains 16 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of Gentle-CLIP with CLIP and S-CLIP about how to adopt unpaired multimodal datas. (a) CLIP only limits on the matched datas for multimodal fushion and may absolutely loss the useful information in unlabeled datas. (b) S-CLIP attempts to improve the alignment performance by two novel pseudo-labeling methods but it limits to the language modality and heavily relies on the way how to measure similarity between samples. (c) Gentle-CLIP try to explore latent alignment from unmatched multimodal datas based on the characteristics of datas themselves, which needs less expert knowledge and can extend to various fields.
  • Figure 2: The overall framework of Gentle-CLIP. Based on CLIP, we redesign new loss functions which work on latent space to generate robust representation through exploring the potential alignment information from unmatched datas.
  • Figure 3: Benchmark results on image-text matching task. In these figures, circle denotes the representation of image in latent space while triangle denotes the latent embedding of text. Visualization points with the same color means the similar semantic. So the goal of this task is to make the circles and triangles with the same color closer while pull away the other points. We can find that Gentle-CLIP can better get the distinguishing representations due to applying the self-supervised contrastive loss.
  • Figure 4: The average gaps between batches and their visual display with different sample sizes.