FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai
TL;DR
FIRE presents a flexible framework to integrate multiple data quality raters for pretraining data selection in large language models. It aligns diverse rating signals into a unified space via win-rate-percentile mappings and fuses them with intrinsic reliability and orthogonality using an orthogonality graph, producing an integrated quality signal I(x) = \mathbf{A}(x)^T (\mathbf{o} \odot \bm{\gamma}). A progressive data selection scheme refines the high-quality data subset by re-evaluating orthogonality within smaller segments. Experiments on SlimPajama with 1.3B and 3B parameter models show FIRE outperforms single-rater and baseline data selection methods, achieving up to 2.9% average gains on downstream tasks while using less than 37.5% of the data required by Random baselines. FIRE’s multi-rater approach demonstrates scalability and data-efficiency advantages, with performance improving as more raters are integrated and when applying progressive selection.
Abstract
Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5\% of the training data needed by the Random baseline to reach the target performance.
