Table of Contents
Fetching ...

JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation

Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjia Ma, Yinghan Shen, Zixuan Li, Jian Guo, Yuanzhuo Wang

TL;DR

JudgeAgent addresses the limitations of static benchmarks by introducing a knowledge-driven dynamic evaluation framework for LLMs. It uses context-graph–driven traversal to broaden knowledge coverage and an interactive, multi-turn interview with difficulty-adaptive question generation to align evaluation with actual capabilities. Empirical results on MedQA, MultiHop-RAG, and QuALITY show JudgeAgent provides more precise assessments and actionable feedback for model iteration, while demonstrating resilience to data contamination. The work highlights a practical path toward more robust, knowledge-driven LLM evaluation.

Abstract

Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.

JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation

TL;DR

JudgeAgent addresses the limitations of static benchmarks by introducing a knowledge-driven dynamic evaluation framework for LLMs. It uses context-graph–driven traversal to broaden knowledge coverage and an interactive, multi-turn interview with difficulty-adaptive question generation to align evaluation with actual capabilities. Empirical results on MedQA, MultiHop-RAG, and QuALITY show JudgeAgent provides more precise assessments and actionable feedback for model iteration, while demonstrating resilience to data contamination. The work highlights a practical path toward more robust, knowledge-driven LLM evaluation.

Abstract

Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.

Paper Structure

This paper contains 33 sections, 6 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: The difference between JudgeAgent and current evaluation paradigms.
  • Figure 2: The framework of JudgeAgent. The left part is the interaction process. The central part is the composition of JudgeAgent. The right part presents the tools of JudgeAgent.
  • Figure 3: The results of the ablation study with MedQA as the base dataset, and all values are percentages.
  • Figure 4: The brief overview of the comparative case in the Case Study.
  • Figure 5: The results of different expansion rounds on MedQA and MultiHop-RAG. @K indicates the ACC improvement after the K-th interaction.
  • ...and 3 more figures