CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features
Po-han Li, Sandeep P. Chinchali, Ufuk Topcu
TL;DR
The paper addresses the data inefficiency of CLIP-style multimodal models by introducing Canonical Similarity Analysis (CSA), which maps two unimodal feature spaces into a shared multimodal space using a CCA-based projection. CSA combines two pre-trained unimodal encoders with a weighted cosine similarity that emphasizes the first few canonical components, enabling CLIP-like capabilities without neural-network training on large multimodal datasets. The authors provide theoretical analysis of the trade-off between information preservation and noise reduction, and demonstrate that CSA can outperform or match CLIP on several tasks while using dramatically less multimodal data, with successful extension to audio and LiDAR modalities. This approach offers a practical path to scalable, data-efficient multimodal learning across diverse modality pairs.
Abstract
Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $50,000\times$ fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.
