Table of Contents
Fetching ...

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao

TL;DR

The paper addresses the mismatch between CLIP-style multimodal models and text-only tasks by introducing a three-stage, multi-task contrastive training regime that jointly optimizes text-image and text-text alignment. Using a dual-encoder setup (JinaBERT and EVA02) and long-caption augmentation, the approach yields a unified model (JinaCLIP-v1) with strong cross-modal retrieval and competitive text-embedding performance. The results demonstrate that a single, jointly trained model can approach state-of-the-art in multimodal tasks while matching or exceeding text-only baselines, with English-language focus due to resource constraints. This work suggests substantial practical benefits for information retrieval systems by reducing the need for separate text- and multimodal models.

Abstract

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

TL;DR

The paper addresses the mismatch between CLIP-style multimodal models and text-only tasks by introducing a three-stage, multi-task contrastive training regime that jointly optimizes text-image and text-text alignment. Using a dual-encoder setup (JinaBERT and EVA02) and long-caption augmentation, the approach yields a unified model (JinaCLIP-v1) with strong cross-modal retrieval and competitive text-embedding performance. The results demonstrate that a single, jointly trained model can approach state-of-the-art in multimodal tasks while matching or exceeding text-only baselines, with English-language focus due to resource constraints. This work suggests substantial practical benefits for information retrieval systems by reducing the need for separate text- and multimodal models.

Abstract

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
Paper Structure (10 sections, 3 equations, 1 figure, 12 tables)

This paper contains 10 sections, 3 equations, 1 figure, 12 tables.

Figures (1)

  • Figure 1: The training paradigm of https://huggingface.co/jinaai/jina-clip-v1 , jointly optimizing text-image and text-text matching.