Table of Contents
Fetching ...

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang

TL;DR

This work addresses CLIP's myopic bias by transitioning from one-to-one image-text alignment to a holistic paradigm that uses (image, multi-texts) data and multi-to-multi contrastive learning. It introduces two low-parameter image-encoder designs (CLS-token and parallel MLP) to produce multiple image embeddings and a loss framework that aligns each image-text pair across multiple perspectives, formalized as $\mathcal{L}_{\text{M2M}} = (\mathcal{L}^{\mathrm{T2I}}_{\text{M2M}} + \mathcal{L}^{\mathrm{I2T}}_{\text{M2M}})/2$. The method leverages four prompts to diversify captions and uses captioning VLMs to generate multi-text data, improving coverage of visual semantics. Across ten benchmarks, Holistic CLIP consistently outperforms the myopic CLIP on image-text retrieval, open-vocabulary classification, and dense visual tasks, demonstrating enhanced interpretability and generalization due to diverse data and part-to-part alignment. This approach has practical impact for building robust, configurable vision-language models capable of long-context understanding and fine-grained visual reasoning.

Abstract

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

TL;DR

This work addresses CLIP's myopic bias by transitioning from one-to-one image-text alignment to a holistic paradigm that uses (image, multi-texts) data and multi-to-multi contrastive learning. It introduces two low-parameter image-encoder designs (CLS-token and parallel MLP) to produce multiple image embeddings and a loss framework that aligns each image-text pair across multiple perspectives, formalized as . The method leverages four prompts to diversify captions and uses captioning VLMs to generate multi-text data, improving coverage of visual semantics. Across ten benchmarks, Holistic CLIP consistently outperforms the myopic CLIP on image-text retrieval, open-vocabulary classification, and dense visual tasks, demonstrating enhanced interpretability and generalization due to diverse data and part-to-part alignment. This approach has practical impact for building robust, configurable vision-language models capable of long-context understanding and fine-grained visual reasoning.

Abstract

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Myopia. OpenAI's CLIP clip uses crude (image, text) web data for one-to-one contrastive alignment, causing serious myopia, i.e., bias to monotonous short texts and shallow visual expressivity. Holism. We advance one holistic CLIP paradigm, by updating colorful (image, multi-texts) data from diverse views, levels; and designing multi-to-multi constrastive learning for image-text part-to-part matching.
  • Figure 2: Pipeline Overview of Holistic CLIP. To evolve data from monotonous (image, text) pairs to colorful (image, multi-texts) pairs, we use powerful VLMs for captioning from multiple views, levels, and granularities. Diverse prompts are defined to encourage diversity. We then modify the CLIP image encoder into multi-branch, and optimize by multi-to-multi constrastive learning for part-to-part matching. During inference, flexible embedding customizations are available for different tasks, showing good interpretability and generalization.
  • Figure 3: Attention Visualization of Holistic CLIP's Vision. Vision is naturally decomposed by aligning with various texts.
  • Figure 4: Architecture Overview of Holistic CLIP. To generate $H$ image features, we leverage two different structures: $\Psi_{\mathrm{CLS}}$ and $\Psi_{\mathrm{MLP}}$. Then we match $H$ image features with $M$ text features. Normally $H=M$ and we apply one-to-one matching.
  • Figure 5: Examples of (Image, Multi-Texts) Data from Multi-VLMs.
  • ...and 4 more figures