Table of Contents
Fetching ...

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao

TL;DR

Jina-embeddings-v4 introduces a unified 3.8B multimodal encoder that processes text and images in a single pathway, producing both single-vector and multi-vector embeddings. It employs three task-specific LoRA adapters to specialize for retrieval, semantic similarity, and code retrieval, while keeping the backbone frozen to enable efficient adaptation. The work also presents Jina-VDR, a broad multilingual benchmark for visually rich document retrieval, and demonstrates strong, often state-of-the-art, performance across multilingual text retrieval, semantic similarity, multimodal retrieval, and code tasks. By reducing the modality gap and enabling cross-modal alignment within one model, it offers a practical, scalable solution for diverse retrieval scenarios and sets a foundation for further multilingual and efficiency-focused enhancements.

Abstract

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

TL;DR

Jina-embeddings-v4 introduces a unified 3.8B multimodal encoder that processes text and images in a single pathway, producing both single-vector and multi-vector embeddings. It employs three task-specific LoRA adapters to specialize for retrieval, semantic similarity, and code retrieval, while keeping the backbone frozen to enable efficient adaptation. The work also presents Jina-VDR, a broad multilingual benchmark for visually rich document retrieval, and demonstrates strong, often state-of-the-art, performance across multilingual text retrieval, semantic similarity, multimodal retrieval, and code tasks. By reducing the modality gap and enabling cross-modal alignment within one model, it offers a practical, scalable solution for diverse retrieval scenarios and sets a foundation for further multilingual and efficiency-focused enhancements.

Abstract

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

Paper Structure

This paper contains 34 sections, 8 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Architecture of jina-embeddings-v4. The model employs a unified LM built on the Qwen2.5-VL-3B-Instruct backbone (3.8B parameters). Text and image inputs are processed through a shared pathway: images are first converted to token sequences via a vision encoder, then both modalities are jointly processed by the language model decoder with contextual attention layers. Three task-specific LoRA adapters (60M parameters each) provide specialized optimization for retrieval, text-matching, and code search tasks without modifying the frozen backbone weights. The architecture supports dual output modes: (1) single-vector embeddings (2048 dimensions, truncatable to 128) generated via mean pooling for efficient similarity search, and (2) multi-vector embeddings (128 dimensions per token) via projection layers for the late interaction style retrieval.
  • Figure 2: Distribution of the cosine similarities of the paired image-text embeddings versus paired text-text embeddings from the Flickr8K dataset. Top: OpenAI CLIP, Middle: jina-clip-v2, Bottom: jina-embeddings-v4
  • Figure 3: Distribution of the cosine similarities of positive (correct matches) versus negative (incorrect matches) image-text samples. (top) OpenAI CLIP, (middle) jina-clip-v2, (bottom) jina-embeddings-v4.