Table of Contents
Fetching ...

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

Po-han Li, Sandeep P. Chinchali, Ufuk Topcu

TL;DR

The paper addresses the data inefficiency of CLIP-style multimodal models by introducing Canonical Similarity Analysis (CSA), which maps two unimodal feature spaces into a shared multimodal space using a CCA-based projection. CSA combines two pre-trained unimodal encoders with a weighted cosine similarity that emphasizes the first few canonical components, enabling CLIP-like capabilities without neural-network training on large multimodal datasets. The authors provide theoretical analysis of the trade-off between information preservation and noise reduction, and demonstrate that CSA can outperform or match CLIP on several tasks while using dramatically less multimodal data, with successful extension to audio and LiDAR modalities. This approach offers a practical path to scalable, data-efficient multimodal learning across diverse modality pairs.

Abstract

Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $50,000\times$ fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

TL;DR

The paper addresses the data inefficiency of CLIP-style multimodal models by introducing Canonical Similarity Analysis (CSA), which maps two unimodal feature spaces into a shared multimodal space using a CCA-based projection. CSA combines two pre-trained unimodal encoders with a weighted cosine similarity that emphasizes the first few canonical components, enabling CLIP-like capabilities without neural-network training on large multimodal datasets. The authors provide theoretical analysis of the trade-off between information preservation and noise reduction, and demonstrate that CSA can outperform or match CLIP on several tasks while using dramatically less multimodal data, with successful extension to audio and LiDAR modalities. This approach offers a practical path to scalable, data-efficient multimodal learning across diverse modality pairs.

Abstract

Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

Paper Structure

This paper contains 20 sections, 11 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Canonical similarity analysis (CSA) replicates the CLIP multimodal similarity scores with two unimodal encoders. CSA uses two unimodal encoders to encode data to unimodal features. Then, it projects the unimodal features to a joint multimodal feature space. The weighted cosine similarity in this feature space replicates the CLIP similarity, enabling various downstream tasks, e.g., cross-modal retrieval. CSA is a robust and data-efficient method that learns even when data are misaligned.
  • Figure 2: Trade of canonical similarity analysis: (a) When $s$ is low, the signal-to-noise ratio (SNR) of the data is high, and the minimum singular value $\lambda_{\mathrm{min}}$ is large. It preserves the distance between contrastive data. (b) When $s$ is low, the p-value is large, and the original and shuffled distributions are alike, so the similarity score is meaningless. (a) shows the desirable properties when $s$ is low and (b) shows the opposite.
  • Figure 3: Image classification: CSA (blue) is highly data-efficient, requiring only $35,000$ training samples to match the performance of CLIP in ImageNet and $360$ samples in Leafy Spurge.
  • Figure 4: Detecting mislabeled ImageNet images: CSA (blue) outperforms CLIP, ASIF, and LLaVA with a higher AUC. (a) and (b) illustrate the results for CSA and ASIF across $2$ training set sizes, showing the superior performance of CSA with limited noisy training data.
  • Figure 5: Detecting misinformative news captions: (a) We consider the retrieved captions misinformative if they do not align with the images and the corresponding captions. (b) CSA (blue) outperforms CLIP, ASIF, and LLaVA with a higher AUC. The orange cross is the supervised learning method from the original COSMOS paper trained with labels of object locations, which is the only method that outperforms CSA.
  • ...and 10 more figures