Table of Contents
Fetching ...

Little Giants: Synthesizing High-Quality Embedding Data at Scale

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou

TL;DR

SPEED, a framework that aligns open-source small models to efficiently generate large-scale synthetic embedding data, is introduced and a comprehensive study on how various factors within the alignment pipeline impact data quality is conducted and the scaling law for synthetic embedding data is revealed.

Abstract

Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.

Little Giants: Synthesizing High-Quality Embedding Data at Scale

TL;DR

SPEED, a framework that aligns open-source small models to efficiently generate large-scale synthetic embedding data, is introduced and a comprehensive study on how various factors within the alignment pipeline impact data quality is conducted and the scaling law for synthetic embedding data is revealed.

Abstract

Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly reducing the cost of human annotation. However, most current approaches rely heavily on proprietary models like GPT-4, which are expensive and inefficient for generating large-scale embedding data. In this paper, we introduce SPEED, a framework that aligns open-source small models (8B) to efficiently generate large-scale synthetic embedding data. Through supervised fine-tuning, preference optimization, and self-improvement, SPEED enables small open-source models to produce high-quality data. Remarkably, SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data. Using this efficient generator, we conduct a comprehensive study on how various factors within the alignment pipeline impact data quality and reveal the scaling law for synthetic embedding data.

Paper Structure

This paper contains 32 sections, 4 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: An illustration comparing the existing pipeline with our data synthesis framework.
  • Figure 2: An overview of SPEED. We align small LLMs (8B) to synthesize large-scale high-quality embedding data.
  • Figure 3: Performances of SPEED (230K data for efficient test) with different settings of the alignment pipeline.
  • Figure 4: Scaling laws for model performance in relation to synthetic embedding data size on MTEB.
  • Figure 5: An example to show the generated preference signals for DPO. A data prompt and a data list are fed into GPT-4 and it evaluates the best and worst data according to the requirements of prompt. The data prompt template is from E5$_\text{mistral}$E5mistral.
  • ...and 6 more figures