Table of Contents
Fetching ...

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu

TL;DR

LLM2CLIP presents a two-stage post-training approach that harnesses large language models to enrich CLIP's textual supervision and cross-modal space. Stage 1 performs caption-contrastive fine-tuning on the LLM to produce more discriminative caption embeddings, while Stage 2 post-trains CLIP with the tuned LLM (via lightweight adaptors) to strengthen cross-modal alignment. Across extensive experiments, LLM2CLIP yields substantial improvements over CLIP, EVA02, and SigLIP2 on zero-shot and cross-lingual retrieval, and enhances multimodal LM pretraining, all with improved training efficiency. The work demonstrates how open-world language understanding can be leveraged to overcome CLIP's limitations with long captions and dense textual descriptions.

Abstract

CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs. Its effectiveness primarily stems from the use of natural language as rich supervision. Motivated by the remarkable advancements in large language models (LLMs), this work explores how LLMs' superior text understanding and extensive open-world knowledge can enhance CLIP's capability, especially for processing longer and more complex image captions. We propose an efficient post-training strategy that integrates LLMs into pretrained CLIP. To address the challenge posed by the autoregressive nature of LLMs, we introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs. Extensive experiments demonstrate that our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance. Furthermore, we validate substantial improvements over state-of-the-art models such as CLIP, EVA02, and SigLip2 across various zero-shot multimodal retrieval tasks, cross-lingual retrieval tasks, and multimodal language model pretraining.

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

TL;DR

LLM2CLIP presents a two-stage post-training approach that harnesses large language models to enrich CLIP's textual supervision and cross-modal space. Stage 1 performs caption-contrastive fine-tuning on the LLM to produce more discriminative caption embeddings, while Stage 2 post-trains CLIP with the tuned LLM (via lightweight adaptors) to strengthen cross-modal alignment. Across extensive experiments, LLM2CLIP yields substantial improvements over CLIP, EVA02, and SigLIP2 on zero-shot and cross-lingual retrieval, and enhances multimodal LM pretraining, all with improved training efficiency. The work demonstrates how open-world language understanding can be leveraged to overcome CLIP's limitations with long captions and dense textual descriptions.

Abstract

CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs. Its effectiveness primarily stems from the use of natural language as rich supervision. Motivated by the remarkable advancements in large language models (LLMs), this work explores how LLMs' superior text understanding and extensive open-world knowledge can enhance CLIP's capability, especially for processing longer and more complex image captions. We propose an efficient post-training strategy that integrates LLMs into pretrained CLIP. To address the challenge posed by the autoregressive nature of LLMs, we introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs. Extensive experiments demonstrate that our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance. Furthermore, we validate substantial improvements over state-of-the-art models such as CLIP, EVA02, and SigLip2 across various zero-shot multimodal retrieval tasks, cross-lingual retrieval tasks, and multimodal language model pretraining.

Paper Structure

This paper contains 21 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: LLM2CLIP Overview. After applying caption contrastive fine-tuning to the LLM, the increased textual discriminability enables more effective CLIP training. We leverage the open-world knowledge and general capabilities of the LLM to better process dense captions, addressing the previous limitations of the pretrained CLIP visual encoder and providing richer, higher-dimensional textual supervision.
  • Figure 2: By adjusting the output space of the LLM, we enable the LLM to more effectively act as a private tutor for CLIP, allowing the fine-tuning process of LLM2CLIP to efficiently adjust the pretrained CLIP's cross-modal space.
  • Figure 3: Real examples of top-1 results from the caption-to-caption retrieval experiment in MS COCO 5K test set. Before fine-tuning, Llama3’s results were often completely unrelated.
  • Figure 4: Radar chart of multilingual results on XM3600, highlighting the differences in T2I and I2T performance of Siglip2 before and after applying LLM2CLIP training.