Table of Contents
Fetching ...

HEMM: Holistic Evaluation of Multimodal Foundation Models

Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, Louis-Philippe Morency

TL;DR

HEMM addresses the need for holistic benchmarking of multimodal foundation models by introducing a three-dimensional taxonomy: basic skills, information flow, and real-world use cases, evaluated across 30 datasets. It analyzes how modeling choices such as scale, pretraining data, multimodal alignment, pretraining objectives, and instruction tuning shape performance, producing actionable insights. Key findings show larger models and instruction tuning yield better results; data diversity improves generalization; translation tasks remain challenging, and results vary across real-world domains. The framework is public and extensible, enabling ongoing community contributions to datasets, models, and metrics.

Abstract

Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

HEMM: Holistic Evaluation of Multimodal Foundation Models

TL;DR

HEMM addresses the need for holistic benchmarking of multimodal foundation models by introducing a three-dimensional taxonomy: basic skills, information flow, and real-world use cases, evaluated across 30 datasets. It analyzes how modeling choices such as scale, pretraining data, multimodal alignment, pretraining objectives, and instruction tuning shape performance, producing actionable insights. Key findings show larger models and instruction tuning yield better results; data diversity improves generalization; translation tasks remain challenging, and results vary across real-world domains. The framework is public and extensible, enabling ongoing community contributions to datasets, models, and metrics.

Abstract

Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.
Paper Structure (56 sections, 2 equations, 8 figures, 16 tables)

This paper contains 56 sections, 2 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: HEMM is an evaluation framework that characterizes multimodal models along several dimensions (size, architecture, pretraining objective, fine-tuning objective, training data) and emphasizes holistic benchmarking of these models at three disentangled levels: basic skills, information flow, and use cases.
  • Figure 2: Responses of GPT-4V and Gemini on samples from the science category. These failure cases show that the models lack domain knowledge and are unable to correctly translate the images of molecules to the SMILES notations (a). Example (b) shows that the models struggle on tasks requiring complex reasoning, failing to comprehend the relation between the force and the size of the magnets. In (c), all models except GPT-4V are unable to capture the fine-grained details and misclassify the image as an airport instead of a runway.
  • Figure 3: Average scores are higher for multimedia datasets as compared to other use cases, and lowest for healthcare, HCI, and science. The models struggle on iNaturalist, Decimer, Enrico, PathVQA, and MemeCap which require external knowledge, fine-grained alignment, and complex reasoning.
  • Figure 4: Tasks requiring commonsense and compositional reasoning are challenging. In (a), GPT-4V and Gemini are unable to employ social commonsense to analyze the relationships between the two people. Example (b) demonstrates the models' difficulty in composing information from both modalities, leading to their failure to comprehend the scenario where a tree smashed into the car (not a car smashed into the tree). In (c), all models except GPT-4V fail to grasp the visual metaphors and the juxtaposition of the two scenarios.
  • Figure 5: On average, large models are better than small and medium models (p-values < 0.001). Instruct-BLIP and BLIP-2 are outliers - despite having fewer params, they achieve relatively high performance, even close to GPT-4V and Gemini.
  • ...and 3 more figures