Table of Contents
Fetching ...

Selection of LLM Fine-Tuning Data based on Orthogonal Rules

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu

TL;DR

This work tackles data quality in LLM training by introducing an automated, rule-based data selection framework that promotes rule diversity through an orthogonality metric and DPP-based rule selection. By generating a broad pool of rules with GPT-4 and scoring data with task-aware rule vectors, the approach minimizes redundancy and tailors data quality signals to the target task. Across IMDB, Medical, Math, and Code, the authors show that rule diversity correlates with rating fidelity and that DPP-selected rule subsets yield superior data for downstream fine-tuning compared to strong baselines. The results demonstrate robust gains in domain-specific tasks and offer a general, scalable pipeline for high-quality data curation in LLM training and RLHF contexts.

Abstract

High-quality training data is critical to the performance of large language models (LLMs). Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We evaluate our framework in two experiment setups: (1) alignment with ground-truth ratings and (2) performance of LLMs fine-tuned on the selected data. Experiments across IMDB, Medical, Math, and Code domains demonstrate that our DPP-based rule selection consistently improves both rating accuracy and downstream model performance over strong baselines.

Selection of LLM Fine-Tuning Data based on Orthogonal Rules

TL;DR

This work tackles data quality in LLM training by introducing an automated, rule-based data selection framework that promotes rule diversity through an orthogonality metric and DPP-based rule selection. By generating a broad pool of rules with GPT-4 and scoring data with task-aware rule vectors, the approach minimizes redundancy and tailors data quality signals to the target task. Across IMDB, Medical, Math, and Code, the authors show that rule diversity correlates with rating fidelity and that DPP-selected rule subsets yield superior data for downstream fine-tuning compared to strong baselines. The results demonstrate robust gains in domain-specific tasks and offer a general, scalable pipeline for high-quality data curation in LLM training and RLHF contexts.

Abstract

High-quality training data is critical to the performance of large language models (LLMs). Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We evaluate our framework in two experiment setups: (1) alignment with ground-truth ratings and (2) performance of LLMs fine-tuned on the selected data. Experiments across IMDB, Medical, Math, and Code domains demonstrate that our DPP-based rule selection consistently improves both rating accuracy and downstream model performance over strong baselines.
Paper Structure (41 sections, 3 theorems, 41 equations, 13 figures, 21 tables)

This paper contains 41 sections, 3 theorems, 41 equations, 13 figures, 21 tables.

Key Result

Theorem 1

Let ${\bm{C}} \in \mathbb{R}^{r \times r}$ be the true correlation matrix among the $r$ rules, i.e., and assume each candidate rule has nontrivial variance ($\Sigma_{j,j} \geq \sigma_{\min}^2 > 0$ for some constant $\sigma_{\min}$). We draw $n$ i.i.d. samples $\{{\bm{x}}^{(k)}\}_{1\leq k \leq n}$ where each ${\bm{x}}^{(k)} \in [0,1]^r$ represents ratings for the $k$-th sample based on $r$ rules.

Figures (13)

  • Figure 1: Pipeline for rule-based data rating and selection in five steps (detailed in Section \ref{['subsec:Algo']})
  • Figure 2: (a) and (b): Pearson correlation of the rule correlation $\rho(\bar{{\bm{S}}})$ and the MSE $\epsilon(\bar{{\bm{S}}})$, using Llama3 8B and 70B raters respectively. (c) and (d): Distribution of MSE from $10^6$ possible rule subsets with size $r$, using Llama3 8B and 70B raters respectively, where two vertical lines represent the MSE values of QuRating and NoRule.
  • Figure 3: (a) Winning rate of DPP-selected rules compared to QuRating’s four rules and the NoRule setting respectively, based on MSE across 100 DPP trials. (b) Comparison of rule correlation $\rho$ between DPP-selected and randomly selected rules, averaged across 100 trials. (c) Comparison of MSE between DPP-selected and randomly selected rules, averaged across 100 trials. Plots (a), (b), and (c) display results using Llama3-8B rater, while (d), (e), and (f) for the Llama3-70B rater.
  • Figure 4: Embeddings of rules generated by GPT and Claude across different domains.
  • Figure 5: Each row corresponds to a domain: (top) Medical, (middle) Math, and (bottom) Code. (a): Pearson correlation between rule correlation $\rho(\bar{{\bm{S}}})$ and MSE $\epsilon(\bar{{\bm{S}}})$. (b): Distribution of MSE from $10^6$ possible rule subsets with size $r$, where vertical lines represent QuRating and NoRule. (c): Winning rate of DPP-selected rule subsets compared to QuRating and NoRule baselines based on MSE. (d,e): Comparison of DPP-selected versus randomly selected rules on rule correlation $\rho(\bar{{\bm{S}}})$ and MSE $\epsilon(\bar{{\bm{S}}})$, respectively.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Lemma 1: Max Entry-wise Deviation
  • proof
  • Lemma 2: Frobenius-Norm Deviation
  • proof