Table of Contents
Fetching ...

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

TL;DR

The paper addresses the limitation that many text embedding models rely primarily on data scale rather than training techniques and data quality. It introduces KaLM-Embedding-V2, a compact 0.5B embedding series that removes the causal mask to enable bidirectional learning, uses a multi-stage training pipeline (pre-training, fine-tuning, and contrastive distillation), and applies a focal-style reweighting with online hard-negative mixing and contrastive distillation from a stronger teacher. Data curation is extensive (around 470M pretraining samples across 20 categories and 6M fine-tuning/distillation samples across 100 categories), including retrieval and non-retrieval tasks with task-specific instructions and hard-negative mining. Evaluated on MTEB cmn and eng, KaLM-Embedding-V2.5 achieves state-of-the-art results among models under 1B parameters and closely rivals models 3–26× larger, demonstrating strong generalization, robustness, and practicality for retrieval and downstream tasks, all while maintaining a transparent, open-source approach.

Abstract

Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters.

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

TL;DR

The paper addresses the limitation that many text embedding models rely primarily on data scale rather than training techniques and data quality. It introduces KaLM-Embedding-V2, a compact 0.5B embedding series that removes the causal mask to enable bidirectional learning, uses a multi-stage training pipeline (pre-training, fine-tuning, and contrastive distillation), and applies a focal-style reweighting with online hard-negative mixing and contrastive distillation from a stronger teacher. Data curation is extensive (around 470M pretraining samples across 20 categories and 6M fine-tuning/distillation samples across 100 categories), including retrieval and non-retrieval tasks with task-specific instructions and hard-negative mining. Evaluated on MTEB cmn and eng, KaLM-Embedding-V2.5 achieves state-of-the-art results among models under 1B parameters and closely rivals models 3–26× larger, demonstrating strong generalization, robustness, and practicality for retrieval and downstream tasks, all while maintaining a transparent, open-source approach.

Abstract

Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters.

Paper Structure

This paper contains 20 sections, 9 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: (Left) Comparison between the KaLM-Embedding series and other models on MTEB. The red dashed line depicts the logarithmic trendline fitted to the performance data of all the baseline models. (Right) Radar charts show our models achieve SOTA performance in a wide array of tasks.
  • Figure 2: The overall training workflow of the KaLM-Embedding-V2 series. The left illustrates the workflow of contrastive learning, while the right shows that of contrastive distillation.
  • Figure 3: Multi-stage training pipeline of the KaLM-Embedding-V2 series.
  • Figure 4: Comparison of discriminative capacity between positive and hard negatives. Cases are randomly sampled from the HotpotQA dataset, where the task instruction is "Instruct: Given a query, retrieve documents that answer the query Query: {query }".
  • Figure 5: Embedding distribution comparisons between KaLM-Embedding-V1, KaLM-Embedding-V2.5, and Qwen3-Embedding-0.6B.