TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

Xiang Li; Yunshi Lan; Chao Yang

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

Xiang Li, Yunshi Lan, Chao Yang

TL;DR

The paper tackles the problem of evaluating LLMs without relying on fixed benchmarks or single-turn judge assessments that risk data leakage and bias. It introduces TreeEval, a benchmark-free framework that uses a tree-planning strategy and an examiner/judge architecture to generate and evaluate interrelated questions under predefined topics. A memory-based eval controller guides topic sampling and question generation, while a score aggregator weights node-level outcomes to produce a robust final score. Empirical results across six open-source LLMs show TreeEval achieves high correlation with AlpacaEval2.0 using only about 45 questions, demonstrating efficiency and fine-grained discriminative power; analyses underscore robustness and potential for deeper evaluation when needed. Code for TreeEval is available at the cited GitHub repository.

Abstract

Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate $6$ models of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around $45$ questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

TL;DR

Abstract

, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate

models of different parameter sizes, including

B, and

B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around

questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.

Paper Structure (26 sections, 4 equations, 6 figures, 5 tables)

This paper contains 26 sections, 4 equations, 6 figures, 5 tables.

Introduction
Related Work
Methods of LLM Evaluation
Data Leakage of LLM Evaluation
Methodology
Overall Architecture
TreeEval Modules
Score Aggregator
Experiments
Experimental Setup
Performance of TreeEval
Further Analysis
Conclusions
Limitations
Ethical Considerations
...and 11 more sections

Figures (6)

Figure 1: Comparison of TreeEval with existing evaluation paradigms.
Figure 2: TreeEval system with an illustrative tree for evaluation. The left section contains the components and their workflow in TreeEval. The right section displays a constructed tree within topic Technology and Communication for evaluation (the leaf nodes are shown in red boxes), where each node denotes a question annotated with its topic and evaluation score. We further display the generated questions of the tree in the (Appendix Eval Controller Example).
Figure 3: Radar chart illustrating the scores of various LLMs under different pre-defined topics.
Figure 4: Re-run TreeEval 5 times for various LLMs.
Figure 5: Examples of evaluation process for two pairs of LLMs under topic "Business and Finance", which are shown in two colored trees. The detailed contents of a node is displayed in a dashed box and the recognized entities used for follow-up topics are shown in red fonts.
...and 1 more figures

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

TL;DR

Abstract

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)