Table of Contents
Fetching ...

ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

Yan Yu, Yilun Liu, Minggui He, Shimin Tao, Weibin Meng, Xinhua Yang, Li Zhang, Hongxia Ma, Dengye Li, Daimeng Wei, Boxing Chen, Fuliang Li

TL;DR

The paper addresses the unreliability of LLM-based pairwise evaluations caused by non-transitive preferences. It introduces ELSPR, a graph-theoretic framework that models reviewer judgments as tournament graphs, detects non-transitivity via strongly connected components, and measures clarity with a two-dimensional directed-graph entropy. By reconstructing SCCs into DAGs, ELSPR filters out ambiguous data to produce a Cleaned training set that yields lower non-transitivity, reduced entropy, and more robust model rankings validated on AlpacaEval and MT-bench. Human studies corroborate that discarded data are significantly more ambiguous, supporting the data-cleaning approach as a practical method to improve human-aligned evaluation systems. Overall, ELSPR demonstrates that training-data quality, framed through graph-theoretic analysis, is a critical lever for robust, consistent evaluation of open-ended LLM capabilities.

Abstract

Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.

ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

TL;DR

The paper addresses the unreliability of LLM-based pairwise evaluations caused by non-transitive preferences. It introduces ELSPR, a graph-theoretic framework that models reviewer judgments as tournament graphs, detects non-transitivity via strongly connected components, and measures clarity with a two-dimensional directed-graph entropy. By reconstructing SCCs into DAGs, ELSPR filters out ambiguous data to produce a Cleaned training set that yields lower non-transitivity, reduced entropy, and more robust model rankings validated on AlpacaEval and MT-bench. Human studies corroborate that discarded data are significantly more ambiguous, supporting the data-cleaning approach as a practical method to improve human-aligned evaluation systems. Overall, ELSPR demonstrates that training-data quality, framed through graph-theoretic analysis, is a critical lever for robust, consistent evaluation of open-ended LLM capabilities.

Abstract

Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: Non-Transitive Preferences in LLM-as-a-Judge for Pairwise Comparisons (e.g., $A \succ B$, $B \succ C$, $C \succ A$).
  • Figure 2: ELSPR (Evaluator LLM training data Self-purification non-transitive Preferences via tournament graph Reconstruction) framework overview. (a) Raw preference data is collected via pairwise comparisons conducted by an advanced LLM. (b) The core analysis and filtering process: The raw data is first modeled as a tournament graph to identify cycles within SCCs. These cycles are then broken by reconstructing each SCC into a DAG based on in-degree ranking. The final global DAG serves as a rule to filter the initial raw data, separating it into a cleaned, transitively consistent training set and a discarded set of non-transitive preferences.
  • Figure 3: Cases of High and Low Structural Entropy in Preference Tournaments.
  • Figure 4: Comparison of data volumes between "Raw" and "Cleaned" training sets across datasets. The "Cleaned" training set's volume is approximately 80% of the "Raw" training set for each dataset.
  • Figure 5: Comparison of Data Volumes Between "Raw" and "Cleaned" Training Sets Across Different Datasets (Using the CoT Comparison (Tie Allowed) Prompt Template)