JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation
Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjia Ma, Yinghan Shen, Zixuan Li, Jian Guo, Yuanzhuo Wang
TL;DR
JudgeAgent addresses the limitations of static benchmarks by introducing a knowledge-driven dynamic evaluation framework for LLMs. It uses context-graph–driven traversal to broaden knowledge coverage and an interactive, multi-turn interview with difficulty-adaptive question generation to align evaluation with actual capabilities. Empirical results on MedQA, MultiHop-RAG, and QuALITY show JudgeAgent provides more precise assessments and actionable feedback for model iteration, while demonstrating resilience to data contamination. The work highlights a practical path toward more robust, knowledge-driven LLM evaluation.
Abstract
Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.
