Selection of LLM Fine-Tuning Data based on Orthogonal Rules
Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu
TL;DR
This work tackles data quality in LLM training by introducing an automated, rule-based data selection framework that promotes rule diversity through an orthogonality metric and DPP-based rule selection. By generating a broad pool of rules with GPT-4 and scoring data with task-aware rule vectors, the approach minimizes redundancy and tailors data quality signals to the target task. Across IMDB, Medical, Math, and Code, the authors show that rule diversity correlates with rating fidelity and that DPP-selected rule subsets yield superior data for downstream fine-tuning compared to strong baselines. The results demonstrate robust gains in domain-specific tasks and offer a general, scalable pipeline for high-quality data curation in LLM training and RLHF contexts.
Abstract
High-quality training data is critical to the performance of large language models (LLMs). Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We evaluate our framework in two experiment setups: (1) alignment with ground-truth ratings and (2) performance of LLMs fine-tuned on the selected data. Experiments across IMDB, Medical, Math, and Code domains demonstrate that our DPP-based rule selection consistently improves both rating accuracy and downstream model performance over strong baselines.
