Table of Contents
Fetching ...

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, Han Xiao

TL;DR

Jina-clip-v2 tackles the limitations of English-centric CLIP models by building a multilingual, multimodal embedding model that supports text-only and crossmodal tasks. It employs a multi-task, multi-stage contrastive learning framework with a multilingual text encoder and an image encoder, trained on 29 languages and visually rich documents, and introduces Matryoshka Representation Learning to allow embedding truncation from $1024$ to as low as $256$ dimensions with minimal loss. The approach yields strong crossmodal and text retrieval performance in English and multilingual settings, and delivers superior visually-rich document understanding on ViDoRe, while enabling flexible embedding sizes. The work provides practical insights into image-resolution choices and modality-gap considerations for CLIP-like systems, and publicly releases jina-clip-v2 for broader use and benchmarking.

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

TL;DR

Jina-clip-v2 tackles the limitations of English-centric CLIP models by building a multilingual, multimodal embedding model that supports text-only and crossmodal tasks. It employs a multi-task, multi-stage contrastive learning framework with a multilingual text encoder and an image encoder, trained on 29 languages and visually rich documents, and introduces Matryoshka Representation Learning to allow embedding truncation from to as low as dimensions with minimal loss. The approach yields strong crossmodal and text retrieval performance in English and multilingual settings, and delivers superior visually-rich document understanding on ViDoRe, while enabling flexible embedding sizes. The work provides practical insights into image-resolution choices and modality-gap considerations for CLIP-like systems, and publicly releases jina-clip-v2 for broader use and benchmarking.

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 26 tables.

Figures (3)

  • Figure 1: https://huggingface.co/jinaai/jina-clip-v2 combines a text encoder (Jina XLM-RoBERTa, 561M parameters) and a vision encoder (EVA02-L14, 304M parameters) for a total of 865M parameters.
  • Figure 2: Performance on the ViDoRe benchmark colpali against input resolution
  • Figure 3: Contrastive learning between two embedding groups (Unified Batch technique). The first group concatenates original images, question texts, and one view of augmented images, while the second group concatenates corresponding image captions, answer texts, and a different view of the augmented images.