F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang; Zihan Liao; Hang Yu; Peng Di; Rui Wang

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Abstract

Paper Structure (13 sections, 5 figures, 9 tables)

This paper contains 13 sections, 5 figures, 9 tables.

Introduction
Related Work
F2LLM-v2
Training Data
Model Architecture
Two-stage Training
Pruning and Knowledge Distillation
Experiments
Main Results
Ablation Studies
Conclusion
Training Data Details
Details on MTEB Evaluation

Figures (5)

Figure 1: The top six models on ten language-specific MTEB leaderboards. The previous SOTA performance is given by the horizontal line. In each subplot title, we list the number of submissions with complete results on the corresponding benchmark. For comparison, the English benchmark has 163 complete submissions.
Figure 2: Top-100 natural languages and top-10 programming languages in our training data.
Figure 3: Comparison between the language distribution of our training data (outer circle) and KaLM-Embedding (inner circle). KaLM-Embedding's data is only annotated with three labels, while ours are annotated with specific languages.
Figure 4: Task type distribution in our training data.
Figure 5: Results of evaluating F2LLM-v2 models at different representation sizes.

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Abstract

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Authors

Abstract

Table of Contents

Figures (5)