Table of Contents
Fetching ...

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

Zijian Chen, Yuze Sun, Yuan Tian, Wenjun Zhang, Guangtao Zhai

TL;DR

MACEval introduces a dynamic, autonomous multi-agent framework for evaluating large models, addressing data contamination, labor-intensive pipelines, and transient metrics. It models evaluation as an interview among interviewee, interviewer, and supervisor, employing in-process data generation and customizable evaluation topologies to probe multiple capabilities. An ACC-AUC–inspired metric and an evaluation-network energy $oldsymbol{\e_{ ext{overall}}}$ quantify longitudinal performance across tasks, validated on 9 open-ended tasks and 23 models. Results indicate improved efficiency, scalability, and robustness to data contamination, highlighting a practical avenue for safer and more maintainable large-model evaluation.

Abstract

Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a \Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

TL;DR

MACEval introduces a dynamic, autonomous multi-agent framework for evaluating large models, addressing data contamination, labor-intensive pipelines, and transient metrics. It models evaluation as an interview among interviewee, interviewer, and supervisor, employing in-process data generation and customizable evaluation topologies to probe multiple capabilities. An ACC-AUC–inspired metric and an evaluation-network energy quantify longitudinal performance across tasks, validated on 9 open-ended tasks and 23 models. Results indicate improved efficiency, scalability, and robustness to data contamination, highlighting a practical avenue for safer and more maintainable large-model evaluation.

Abstract

Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a \Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.

Paper Structure

This paper contains 32 sections, 4 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Paradigm comparison between current large model evaluations and our proposed MACEval.
  • Figure 2: An overview of our proposed MACEval, which consists of three primary phases: evaluation capability determination, MAEN construction, and open-ended task selection. The pipeline models the evaluation of large models as a multi-round interview process. Specialized agents like interviewers for direct performance evaluation and third-party supervisors for validity assessment of the entire process with a message propagation mechanism, enabling collaboration between interviewer models and other functional models to efficiently and automatically produce reliable evaluation outputs.
  • Figure 3: A data card of 9 open-ended tasks that evaluate the visual perception, text comprehension, math, algorithm reasoning, and coding abilities of large models.
  • Figure 4: Example of an evaluation network, where colored clusters represent evaluation agents for different capabilities and red arrows denote activated evaluation routes.
  • Figure 5: Performance curves and ACC-AUC values of different series of MLLMs. The positions of the red circles indicate the upper bound of the model's capabilities. The upper and lower parts of the subfigure depict the IQP and CU tasks, respectively.
  • ...and 6 more figures