Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Michael Günther, Louis Milliken, Jonathan Geuter, Georgios Mastrapas, Bo Wang, Han Xiao
TL;DR
Jina Embeddings presents a set of high-performance sentence embedding models trained via large-scale contrastive learning on encoder-only T5 representations. The authors construct a two-phase training pipeline using a broad, multi-task dataset formulated as pairs and triplets, coupled with a dedicated negation dataset to improve negation sensitivity. A rigorous data-filtering regime (deduplication, language and consistency filtering) dramatically reduces data needs from billions of items to hundreds of millions, while achieving competitive results on the Massive Text Embedding Benchmark (MTEB) across retrieval, similarity, and reranking tasks. The work highlights practical benefits of data quality and task-focused training, and offers a path toward more efficient, transferable embeddings with potential extensions to bilingual and longer-sequence applications.
Abstract
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model's awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
