Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang
TL;DR
This work addresses CLIP's myopic bias by transitioning from one-to-one image-text alignment to a holistic paradigm that uses (image, multi-texts) data and multi-to-multi contrastive learning. It introduces two low-parameter image-encoder designs (CLS-token and parallel MLP) to produce multiple image embeddings and a loss framework that aligns each image-text pair across multiple perspectives, formalized as $\mathcal{L}_{\text{M2M}} = (\mathcal{L}^{\mathrm{T2I}}_{\text{M2M}} + \mathcal{L}^{\mathrm{I2T}}_{\text{M2M}})/2$. The method leverages four prompts to diversify captions and uses captioning VLMs to generate multi-text data, improving coverage of visual semantics. Across ten benchmarks, Holistic CLIP consistently outperforms the myopic CLIP on image-text retrieval, open-vocabulary classification, and dense visual tasks, demonstrating enhanced interpretability and generalization due to diverse data and part-to-part alignment. This approach has practical impact for building robust, configurable vision-language models capable of long-context understanding and fine-grained visual reasoning.
Abstract
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
