Table of Contents
Fetching ...

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

TL;DR

AutoJudger introduces an agent-driven framework for efficient benchmarking of multimodal LLMs by integrating IRT-based question difficulty, an autonomous judging agent, semantic-aware retrieval, and a dynamic memory system. The method adaptively selects informative questions in real-time, reducing evaluation cost while preserving model ranking fidelity, demonstrated by achieving over 90% ranking consistency with only 4% of the full benchmark data on MMT-Bench. It combines offline difficulty estimation with online ability tracking, semantic diversity, and memory-guided decision-making to balance difficulty and coverage across modalities. Comprehensive experiments across four benchmarks with 17 MLLMs show strong performance and stability, indicating AutoJudger as a scalable, transparent solution for multimodal model evaluation.

Abstract

Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

TL;DR

AutoJudger introduces an agent-driven framework for efficient benchmarking of multimodal LLMs by integrating IRT-based question difficulty, an autonomous judging agent, semantic-aware retrieval, and a dynamic memory system. The method adaptively selects informative questions in real-time, reducing evaluation cost while preserving model ranking fidelity, demonstrated by achieving over 90% ranking consistency with only 4% of the full benchmark data on MMT-Bench. It combines offline difficulty estimation with online ability tracking, semantic diversity, and memory-guided decision-making to balance difficulty and coverage across modalities. Comprehensive experiments across four benchmarks with 17 MLLMs show strong performance and stability, indicating AutoJudger as a scalable, transparent solution for multimodal model evaluation.

Abstract

Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.

Paper Structure

This paper contains 52 sections, 14 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Benchmark scale and efficiency of AutoJudger. (a) plots the scales of various benchmarks that are commonly adopted in MLLM evaluation. The triangle and pentagon markers indicate the number of samples required by AutoJudger to achieve 85% and 90% consistency with the full-set evaluation results, respectively. (b) compares several efficient benchmarking methods on MMT-Bench. AutoJudger achieves 92% rank consistency using only 4% of the data (125 samples).
  • Figure 2: The framework of AutoJudger. Before evaluation, the difficulties of question from a benchmark are computed by utilizing a set of offline models. At each evaluation iteration, AutoJudger firstly retrieve the candidate questions based on the estimated ability. Then, AutoJudger selects the most proper question, collect the response from the evaluated model, and update its memory.
  • Figure 3: Evaluation performance and stability under varying compression ratios.
  • Figure 4: Ranking accuracy of AutoJudger computed at MMMU$_{\textit{DEV VAL}}$ under three distinct difficulty settings.
  • Figure 5: Comparison of ability-difficulty distance (bar chart, left y-axis) and semantic distance (line plot, right y-axis) with and without memroy.
  • ...and 4 more figures