Table of Contents
Fetching ...

FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai

TL;DR

FIRE presents a flexible framework to integrate multiple data quality raters for pretraining data selection in large language models. It aligns diverse rating signals into a unified space via win-rate-percentile mappings and fuses them with intrinsic reliability and orthogonality using an orthogonality graph, producing an integrated quality signal I(x) = \mathbf{A}(x)^T (\mathbf{o} \odot \bm{\gamma}). A progressive data selection scheme refines the high-quality data subset by re-evaluating orthogonality within smaller segments. Experiments on SlimPajama with 1.3B and 3B parameter models show FIRE outperforms single-rater and baseline data selection methods, achieving up to 2.9% average gains on downstream tasks while using less than 37.5% of the data required by Random baselines. FIRE’s multi-rater approach demonstrates scalability and data-efficiency advantages, with performance improving as more raters are integrated and when applying progressive selection.

Abstract

Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5\% of the training data needed by the Random baseline to reach the target performance.

FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

TL;DR

FIRE presents a flexible framework to integrate multiple data quality raters for pretraining data selection in large language models. It aligns diverse rating signals into a unified space via win-rate-percentile mappings and fuses them with intrinsic reliability and orthogonality using an orthogonality graph, producing an integrated quality signal I(x) = \mathbf{A}(x)^T (\mathbf{o} \odot \bm{\gamma}). A progressive data selection scheme refines the high-quality data subset by re-evaluating orthogonality within smaller segments. Experiments on SlimPajama with 1.3B and 3B parameter models show FIRE outperforms single-rater and baseline data selection methods, achieving up to 2.9% average gains on downstream tasks while using less than 37.5% of the data required by Random baselines. FIRE’s multi-rater approach demonstrates scalability and data-efficiency advantages, with performance improving as more raters are integrated and when applying progressive selection.

Abstract

Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5\% of the training data needed by the Random baseline to reach the target performance.

Paper Structure

This paper contains 62 sections, 1 theorem, 27 equations, 11 figures, 10 tables, 1 algorithm.

Key Result

theorem 1

The overall orthogonality $o_i$ of a rater $R_i$ with other raters can be quantified as weighted degree centrality of the corresponding vertex $V_i$ in the orthogonality graph.

Figures (11)

  • Figure 1: Downstream accuracy with respect to pretraining tokens for Random, FIRE, and FIRE Progressive.
  • Figure 2: Overall framework of FIRE, which contains two processes: (a) Rating Alignment and (b) Rater Integration.
  • Figure 3: Ablation experiments on the impact of different rating integration strategies in FIRE.
  • Figure 4: The in-context learning results with respect to pretraining tokens on four downstream tasks: ARC-E, ARC-C, SciQ, and HellaSwag.
  • Figure 5: The impact of the partition multiplier factor $\beta$ on FIRE Progressive performance. The red and orange dashed lines respectively represent the scores of FIRE and Random.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Definition 1: Orthogonality Graph
  • theorem 1
  • proof
  • proof