Revisiting Multi-Modal LLM Evaluation

Jian Lu; Shikhar Srivastava; Junyu Chen; Robik Shrestha; Manoj Acharya; Kushal Kafle; Christopher Kanan

Revisiting Multi-Modal LLM Evaluation

Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan

TL;DR

This work challenges the adequacy of prevailing MLLM benchmarks by introducing slim, zero-shot evaluation sets for VQA and grounding tasks, integrated within the LAVIS framework. It systematically evaluates a diverse set of models—ranging from open-weight LLaVA variants to closed-weight GPT-4V/4o—across VQDv1, TDIUC, TallyQA, and DVQA to reveal new weaknesses in visual grounding, counting, and chart understanding. The study demonstrates that many models struggle with multi-object grounding, complex counting, and OCR-reliant chart tasks, and that prompt design significantly shapes performance. By providing ready-to-use, modular evaluation datasets and tooling, the paper offers a practical path toward more reliable benchmarking and faster progress in multi-modal grounding and reasoning research.

Abstract

With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/

Revisiting Multi-Modal LLM Evaluation

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 12 figures, 14 tables)

This paper contains 36 sections, 1 equation, 12 figures, 14 tables.

Introduction
This paper makes the following contributions:
Multi-modal Large Language Models
Creating "Slim" Evaluation Sets
Experiments
Visual Query Detection with VQDv1
VQDv1 Metrics.
Results for VQDv1.
Fine-Grained VQA Assessment with TDIUC
TDIUC Metrics.
Results for TDIUC.
Assessing Counting Ability with TallyQA
TallyQA Metrics.
Results for TallyQA.
Assessing Chart Comprehension with DVQA
...and 21 more sections

Figures (12)

Figure 1: While only one object needs to be detected in popular referring expression comprehension datasets, VQDv1 requires identifying all regions that satisfy a query.
Figure 2: Recall and precision curves for queries with varying box counts.
Figure 3: TDIUC has 12 kinds of questions, enabling fine-grained analysis of MLLMs.
Figure 4: Examples of simple and complex counting questions in TallyQA.
Figure 5: An example from DVQA.
...and 7 more figures

Revisiting Multi-Modal LLM Evaluation

TL;DR

Abstract

Revisiting Multi-Modal LLM Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)