Table of Contents
Fetching ...

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

TL;DR

ChEF introduces a modular, recipe-based evaluation framework for Multimodal Large Language Models, unifying Scenario, Instruction, Inferencer, and Metric into interchangeable components to enable scalable, fair comparisons. It formalizes six desiderata (Calibration, ICL, Instruction Following, Language Performance, Hallucination, Robustness) as evaluable Recipes and validates them through a large-scale study across 9 MLLMs and 9 Scenarios. The framework reveals persistent weaknesses in ICL, instruction following, and robustness, and shows strong links between desiderata and visual-task performance, while providing stability analyses and GPT-based language evaluation strategies. By offering a growing toolkit and standardization, ChEF aims to facilitate broader, more reliable benchmarking and faster progress in open-source MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks. However, even though a list of benchmarks has been proposed, the capabilities and limitations of MLLMs are still not comprehensively understood, due to a lack of a standardized and holistic evaluation framework. To this end, we present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs. First, we structure ChEF as four modular components, i.e., Scenario as scalable multimodal datasets, Instruction as flexible instruction retrieving formulae, Inferencer as reliable question answering strategies, and Metric as indicative task-specific score functions. Based on them, ChEF facilitates versatile evaluations in a standardized framework, and new evaluations can be built by designing new Recipes (systematic selection of these four components). Notably, current MLLM benchmarks can be readily summarized as recipes of ChEF. Second, we introduce 6 new recipes to quantify competent MLLMs' desired capabilities (or called desiderata, i.e., calibration, in-context learning, instruction following, language performance, hallucination, and robustness) as reliable agents that can perform real-world multimodal interactions. Third, we conduct a large-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata. Our evaluation summarized over 20 valuable observations concerning the generalizability of MLLMs across various scenarios and the composite capability of MLLMs required for multimodal interactions. We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

TL;DR

ChEF introduces a modular, recipe-based evaluation framework for Multimodal Large Language Models, unifying Scenario, Instruction, Inferencer, and Metric into interchangeable components to enable scalable, fair comparisons. It formalizes six desiderata (Calibration, ICL, Instruction Following, Language Performance, Hallucination, Robustness) as evaluable Recipes and validates them through a large-scale study across 9 MLLMs and 9 Scenarios. The framework reveals persistent weaknesses in ICL, instruction following, and robustness, and shows strong links between desiderata and visual-task performance, while providing stability analyses and GPT-based language evaluation strategies. By offering a growing toolkit and standardization, ChEF aims to facilitate broader, more reliable benchmarking and faster progress in open-source MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks. However, even though a list of benchmarks has been proposed, the capabilities and limitations of MLLMs are still not comprehensively understood, due to a lack of a standardized and holistic evaluation framework. To this end, we present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs. First, we structure ChEF as four modular components, i.e., Scenario as scalable multimodal datasets, Instruction as flexible instruction retrieving formulae, Inferencer as reliable question answering strategies, and Metric as indicative task-specific score functions. Based on them, ChEF facilitates versatile evaluations in a standardized framework, and new evaluations can be built by designing new Recipes (systematic selection of these four components). Notably, current MLLM benchmarks can be readily summarized as recipes of ChEF. Second, we introduce 6 new recipes to quantify competent MLLMs' desired capabilities (or called desiderata, i.e., calibration, in-context learning, instruction following, language performance, hallucination, and robustness) as reliable agents that can perform real-world multimodal interactions. Third, we conduct a large-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata. Our evaluation summarized over 20 valuable observations concerning the generalizability of MLLMs across various scenarios and the composite capability of MLLMs required for multimodal interactions. We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.
Paper Structure (43 sections, 5 equations, 26 figures, 13 tables)

This paper contains 43 sections, 5 equations, 26 figures, 13 tables.

Figures (26)

  • Figure 1: (a) ChEF Overview. (b) Current MLLM benchmarks can be readily absorbed into ChEF. Acc. is the accuracy. Acc.* is the accuracy from GPT-based metric. $\cap$ means overlap with ChEF. ICL, Lang. Perf., Instruct. Follow. are shorts for in-context learning, language performance, and instruction following, respectively.
  • Figure 2: Two examples of Recipes in ChEF. A Recipe consists of {Scenario, Instruction, Inferencer, Metric}. The Recipe of (a) is {Flickr30k, ICE, PPL, Accuracy}, while (b) is {VOC2012, Query, Multi-Turn, Accuracy}.
  • Figure 3: Recipes for evaluating six dimensions of desiderata. 1) All six dimensions are assessed on MMBench and ScienceQA, except for Hallucination, which is evaluated solely on MSCOCO; 2) All use standard query as Instruction, except ICL uses random ICE; 3) All employ Multi-Turn from CoT to PPL as Inferencer, except Hallucination with a single PPL; 4) The Metric for each dimension is specifically designed for the respective evaluation.
  • Figure 4: The exemplar of desiderata. The distinguished design of each desideratum is marked in red. For calibration evaluation, the prediction confidence is calculated to determine the gap between confidence and accuracy. Instruction following is evaluated through verbalizer manipulation. In-context learning is evaluated by providing ICE in the instruction. Robustness is assessed by introducing noise to both the image and text inputs. Language performance is evaluated by instructing the model to generate chain-of-thought content. Hallucination is solely evaluated on MSCOCO, and evaluated by querying whether a specific object is present in the image.
  • Figure 5: Results of desiderata. The dashline is the accuracy evaluated on MMBench. The score for each dimension is computed by normalizing the results from the specific metric to a range of 0-100. Calibration score is represented by 1-ECE. Instruction following score is the average MR across different verbalizer settings. In-context learning score is the average RIAM across various shot numbers. Language performance score is normalized from the results of the GPT-based metric. Robustness score is normalized from RMM and hallucination score directly represents accuracy.
  • ...and 21 more figures