Table of Contents
Fetching ...

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

Siyuan Wang, Zhuohan Long, Zhihao Fan, Zhongyu Wei, Xuanjing Huang

TL;DR

This work tackles the inadequacy of static benchmarks for rapidly evolving LLMs by introducing Benchmark Self-Evolving, a dynamic evaluation framework that generates evolving instances through seven reframing operations. A GPT-4–driven four-agent system (pre-filter, creator, verifier, candidate option formulator) reliably creates and validates new context-question-answer triplets, enabling scalable, robust, and fine-grained testing across four tasks. Experimental results show evolving benchmarks generally reduce model performance compared to original evaluations and amplify inter-model and intra-model disparities, offering a more nuanced view of capabilities and limitations. The framework also demonstrates resilience to data contamination and provides insights into sub-abilities, informing model selection and future improvements while acknowledging computational and ethical considerations.

Abstract

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

TL;DR

This work tackles the inadequacy of static benchmarks for rapidly evolving LLMs by introducing Benchmark Self-Evolving, a dynamic evaluation framework that generates evolving instances through seven reframing operations. A GPT-4–driven four-agent system (pre-filter, creator, verifier, candidate option formulator) reliably creates and validates new context-question-answer triplets, enabling scalable, robust, and fine-grained testing across four tasks. Experimental results show evolving benchmarks generally reduce model performance compared to original evaluations and amplify inter-model and intra-model disparities, offering a more nuanced view of capabilities and limitations. The framework also demonstrates resilience to data contamination and provides insights into sub-abilities, informing model selection and future improvements while acknowledging computational and ethical considerations.

Abstract

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence that dynamically extend existing benchmarks. Towards a more scalable, robust and fine-grained evaluation, we implement six reframing operations to construct evolving instances testing LLMs against diverse queries, data noise and probing their problem-solving sub-abilities. With this framework, we extend benchmark datasets of four tasks. Experimental results show a general performance decline in most LLMs against their original results. This decline under our scalable and robust evaluations, alongside our fine-grained evaluation, more accurately reflect models' capabilities. Besides, our framework widens performance discrepancies both between different models and within the same model across various tasks, facilitating more informed model selection for specific tasks (Code and data are available at https://github.com/NanshineLoong/Self-Evolving-Benchmark).
Paper Structure (29 sections, 2 equations, 6 figures, 16 tables)

This paper contains 29 sections, 2 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: The evolution of LLMs necessitates benchmark self-evolving.
  • Figure 2: The workflow of our Multi-Agent Evolving Instance Generator system.
  • Figure 3: Comparison of evolving results using various reframing operations versus original results. Darker bars show accuracy for each operation across all datasets, with lighter bars ahead representing original accuracy.
  • Figure 4: Results of fine-grained sub-ability evaluation.
  • Figure 5: Comparison of LLama-2-7B-Chat models under different contamination conditions. "Vanilla", "In-domain Cont." and "Direct Cont." denotes the original model, the in-domain contaminated and direct contaminated models.
  • ...and 1 more figures