Table of Contents
Fetching ...

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

TL;DR

F2LLM presents a family of open-source embedding models (0.6B, 1.7B, 4B) finetuned directly from foundation models on 6 million open-source, non-synthetic query–document–hard negative tuples. Employing a single-stage contrastive finetuning with margin-based hard negative mining across unified retrieval/classification/clustering data, the approach achieves competitive MTEB performance without expensive synthetic data or multi-stage pretraining. The authors release all model checkpoints, training data, and code, and report strong leaderboard results (e.g., 4B at 2nd in size, 7th overall; 1.7B at top in 1B–2B) along with a record clustering score, establishing a reproducible, budget-friendly baseline for future embedding research. This work demonstrates that carefully curated open-source data and efficient training can match or exceed SOTA results typically achieved with much larger, synthetic, or closed datasets, accelerating accessibility and experimentation in embedding methods.

Abstract

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

TL;DR

F2LLM presents a family of open-source embedding models (0.6B, 1.7B, 4B) finetuned directly from foundation models on 6 million open-source, non-synthetic query–document–hard negative tuples. Employing a single-stage contrastive finetuning with margin-based hard negative mining across unified retrieval/classification/clustering data, the approach achieves competitive MTEB performance without expensive synthetic data or multi-stage pretraining. The authors release all model checkpoints, training data, and code, and report strong leaderboard results (e.g., 4B at 2nd in size, 7th overall; 1.7B at top in 1B–2B) along with a record clustering score, establishing a reproducible, budget-friendly baseline for future embedding research. This work demonstrates that carefully curated open-source data and efficient training can match or exceed SOTA results typically achieved with much larger, synthetic, or closed datasets, accelerating accessibility and experimentation in embedding methods.

Abstract

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

Paper Structure

This paper contains 11 sections, 4 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: (Left): MTEB performance comparison between LLM-based embedding models. (Right): F2LLM, trained solely on open-source non-synthetic data, achieves a strong balance between embedding performance, training data, and model size. Higher scores indicate better performance (left axis), fewer training data (right axis), and smaller model size (bottom axis).