Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu
TL;DR
This work provides a formal theory for how CLIP learns transferable, cross-modal representations and enables zero-shot transfer. It identifies key challenges in aligning image and text features and introduces an $(\alpha,\beta,\gamma)$-completeness framework under which near-optimal training yields separable, aligned representations across modalities; it also contrasts CLIP's contrastive loss with square loss, showing the latter's failure for zero-shot tasks. Building on these insights, the authors propose a one-sided regularization term that increases the learned margin and empirically improves zero-shot transfer and linear probing on benchmark datasets. The combination of theory and experiments yields practical guidance for improving multimodal transfer learning with small batches and temperature-aware training, while outlining limitations and avenues for extending to more modalities.
Abstract
Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that employs vision-language contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zero-shot learning and text-guided natural image generation. Despite the huge practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zero-shot transfer performance on the downstream tasks. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
