Table of Contents
Fetching ...

LIME: Less Is More for MLLM Evaluation

King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian Yang, Zachary Liu, Xiang Yue, J. H. Liu, Chenghua Lin, Min Yang, Shiwen Ni, Wenhao Huang, Ge Zhang

TL;DR

This work tackles the inefficiency and noise in existing Multimodal LLM benchmarks by introducing LIME, a semi-automated data-curation pipeline that uses open-source judges, a screening process, and leakage elimination to produce a high-quality, compact benchmark (9,403 samples across 6 domains). LIME significantly reduces data volume and evaluation time while improving the ability to distinguish model capabilities; it also reveals that traditional captioning metrics like CIDEr are unreliable for assessing MLLMs and should be excluded from the overall score. The empirical results show LIME poses a greater challenge to MLLMs and emphasizes deeper image perception, with broad correlations to established benchmarks but improved discriminative power. The framework is extensible and aims to align evaluation more closely with real-world user queries through future updates and expansions.

Abstract

Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models. Notably, we find that traditional automatic metrics, such as CIDEr, are inadequate for assessing MLLMs' captioning performance; excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://github.com/kangreen0210/LIME.

LIME: Less Is More for MLLM Evaluation

TL;DR

This work tackles the inefficiency and noise in existing Multimodal LLM benchmarks by introducing LIME, a semi-automated data-curation pipeline that uses open-source judges, a screening process, and leakage elimination to produce a high-quality, compact benchmark (9,403 samples across 6 domains). LIME significantly reduces data volume and evaluation time while improving the ability to distinguish model capabilities; it also reveals that traditional captioning metrics like CIDEr are unreliable for assessing MLLMs and should be excluded from the overall score. The empirical results show LIME poses a greater challenge to MLLMs and emphasizes deeper image perception, with broad correlations to established benchmarks but improved discriminative power. The framework is extensible and aims to align evaluation more closely with real-world user queries through future updates and expansions.

Abstract

Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models. Notably, we find that traditional automatic metrics, such as CIDEr, are inadequate for assessing MLLMs' captioning performance; excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://github.com/kangreen0210/LIME.
Paper Structure (40 sections, 2 equations, 27 figures, 7 tables)

This paper contains 40 sections, 2 equations, 27 figures, 7 tables.

Figures (27)

  • Figure 1: Pipeline of the Data Curation. The left half part is the Open-Source Models as Judges module, which uses several Multimodal LLMs to answer questions for each sample and assess their difficulty. The upper right part is the Semi-Automated Screening Process module filtering some samples that are too simple or difficult. As for the Eliminating Answer Leakage, we filter the sample that can be answered without the image.
  • Figure 2: Overall data statics about selected subtasks. Easy: questions that most models can answer correctly, Bad Case: questions that may contain errors, Remained: questions that finally remain.
  • Figure 3: The number of samples removed at each stage compared to the original data, including three stages of filtering and the final sampling stage.
  • Figure 4: Correlation distribution between LIME and Wildvison Elo.
  • Figure 5: with the same series of models, the distribution differences of various Parameter sizes. Left($\bigstar$): LLaVA-1.6 series, Right($\blacktriangle$): InternVL-2 series
  • ...and 22 more figures