Table of Contents
Fetching ...

Automatic benchmarking of large multimodal models via iterative experiment programming

Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, Elisa Ricci

TL;DR

APEx tackles the tedious process of benchmarking large multimodal models by introducing an automated, LLM-driven framework that designs benchmarks, runs experiments, and compiles reports in an iterative loop. It leverages a modular tool ecosystem (image generation, retrieval, and transformations) to generate data and evaluate multiple LMMs, terminating when the aggregated evidence suffices to answer user queries. The approach demonstrates reproducibility of prior findings on data-type transformations, supports analyses at multiple granularity levels, and remains extensible to new tools and models. This framework promises dramatically reduced evaluation effort and improved transparency in comparing multimodal capabilities across diverse scenarios.

Abstract

Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.

Automatic benchmarking of large multimodal models via iterative experiment programming

TL;DR

APEx tackles the tedious process of benchmarking large multimodal models by introducing an automated, LLM-driven framework that designs benchmarks, runs experiments, and compiles reports in an iterative loop. It leverages a modular tool ecosystem (image generation, retrieval, and transformations) to generate data and evaluate multiple LMMs, terminating when the aggregated evidence suffices to answer user queries. The approach demonstrates reproducibility of prior findings on data-type transformations, supports analyses at multiple granularity levels, and remains extensible to new tools and models. This framework promises dramatically reduced evaluation effort and improved transparency in comparing multimodal capabilities across diverse scenarios.

Abstract

Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.
Paper Structure (17 sections, 6 figures, 2 tables)

This paper contains 17 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: APEx. Our automatic benchmarking tool has four components: an $\operatorname{orchestrator}$ for reasoning, an $\operatorname{engine}$ for function execution, a benchmark $\operatorname{generator}$ containing image selection and manipulation tools, and a $\operatorname{library}$ of LMMs. Given a user $\operatorname{query}$, the $\operatorname{orchestrator}$ instantiates a $\operatorname{report}$ containing the query and the LMMs to be tested. Then $\operatorname{orchestrator}$ receives the $\operatorname{report}$ and specifies a first $\operatorname{experiment}$ to be executed. The relative benchmark is generated and executed by the $\operatorname{engine}$, with the $\operatorname{results}$ collected, discussed by the $\operatorname{orchestrator}$ and added to the $\operatorname{report}$. The $\operatorname{orchestrator}$ repeats the experimentation loop until it is deemed sufficient to answer the query. In that case, the $\operatorname{orchestrator}$ summarizes the report, returning its findings.
  • Figure 1: Data type group ranking. Ranking of data type group recognition performance from the best (top) to worst (bottom). APEx achieves the same ranking of udandarao2023visual both when testing each data type independently and aggregating metrics by their group (Avg. types), and when directly querying for the group understanding (Groups).
  • Figure 2: Data types identification. Summary of the normalized results (vertical axis) achieved by the models across the 27 data type identification tasks, in comparison to those obtained in udandarao2023visual. The values equal the min-max normalized performance across the set of experiments designed by APEx.
  • Figure 3: Data classes recognition. Accuracy averaged over experiments of APEx for BLIP-2, IDEFICS, and LLaVA on the nine recognition tasks.
  • Figure 4: Data classes robustness to data types. Summary of the average accuracy achieved by BLIP-2, IDEFICS, and LLaVA across the nine recognition tasks when adding different data type transformations. Accuracy is averaged across the set of experiments designed by APEx.
  • ...and 1 more figures