Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, Xuanjing Huang
TL;DR
This work tackles the inadequacy of static benchmarks for rapidly evolving LLMs by introducing Benchmark Self-Evolving, a dynamic evaluation framework that generates evolving instances through seven reframing operations. A GPT-4–driven four-agent system (pre-filter, creator, verifier, candidate option formulator) reliably creates and validates new context-question-answer triplets, enabling scalable, robust, and fine-grained testing across four tasks. Experimental results show evolving benchmarks generally reduce model performance compared to original evaluations and amplify inter-model and intra-model disparities, offering a more nuanced view of capabilities and limitations. The framework also demonstrates resilience to data contamination and provides insights into sub-abilities, informing model selection and future improvements while acknowledging computational and ethical considerations.
Abstract
This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).
