Table of Contents
Fetching ...

Eureka: Evaluating and Understanding Large Foundation Models

Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi

TL;DR

This work introduces Eureka, an open-source framework and Benchmark suite for rigorous, transparent evaluation of large foundation models across diverse language and multimodal capabilities. By emphasizing modular evaluation pipelines, disaggregated analyses, non-determinism, and backward compatibility, it reveals that no single model dominates; different models excel in different areas while several fundamental abilities—like detailed image grounding, factuality, and stable outputs—remain challenging. The findings highlight the importance of open methodologies, reproducibility, and targeted improvements to drive robust, real-world AI systems. The framework and benchmarks aim to guide researchers and practitioners toward more nuanced model development and safer, more reliable deployment.

Abstract

Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.

Eureka: Evaluating and Understanding Large Foundation Models

TL;DR

This work introduces Eureka, an open-source framework and Benchmark suite for rigorous, transparent evaluation of large foundation models across diverse language and multimodal capabilities. By emphasizing modular evaluation pipelines, disaggregated analyses, non-determinism, and backward compatibility, it reveals that no single model dominates; different models excel in different areas while several fundamental abilities—like detailed image grounding, factuality, and stable outputs—remain challenging. The findings highlight the importance of open methodologies, reproducibility, and targeted improvements to drive robust, real-world AI systems. The framework and benchmarks aim to guide researchers and practitioners toward more nuanced model development and safer, more reliable deployment.

Abstract

Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark saturation, lack of transparency in methods used for measurement, development challenges in extracting measurements for generative tasks, and, more generally, the extensive number of capabilities required for a well-rounded comparison across models. We make three contributions to alleviate the above challenges. First, we present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities. The inherent space for improvement in non-saturated benchmarks enables us to discover meaningful differences between models at a capability level. Third, using Eureka, we conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison, which can be leveraged to plan targeted improvements. In contrast to recent trends in reports and leaderboards showing absolute rankings and claims for one model or another to be the best, our analysis shows that there is no such best model. Different models have different strengths, but there are models that appear more often than others as best performers for some capabilities. Despite the recent improvements, current models still struggle with several fundamental capabilities including detailed image understanding, benefiting from multimodal input when available rather than fully relying on language, factuality and grounding for information retrieval, and over refusals.
Paper Structure (27 sections, 35 figures, 12 tables)

This paper contains 27 sections, 35 figures, 12 tables.

Figures (35)

  • Figure 1: Performance of best and worse models for multimodal (left) and language (right) datasets in in Eureka-Bench. The red frontier shows the performance of the worse model, indicating the area that is already solved for the set of capabilities. The green frontier shows the performance of the best model, indicating the best known result with current technology. The blue horizon between the best model and the maximum performance shows the room for improvement for mastering the capability. The best performance sets indicated in the green border include all models that perform within 2% of the best observed result.
  • Figure 2: Overview of experiment pipelines for two example evaluation experiments: Toxigen Generative (a) and GeoMeter (b). Components are configurable at instantiation time to maximize code reuse and enable controlled experimentation. For example, the PromptProcessing component is shown here to use different data readers or prompt templates in different contexts.
  • Figure 3: Samples from the GeoMeter dataset. Here, each sample is shown with random query attributes including color - numeric label and color - shape label.
  • Figure 4: Sample image-text pair from the GeoMeter dataset. Here the image contains 5 shapes labeled with random numeric labels which are used as query attributes in the prompt. Prompt template shows the basic template for each image-text pair of all our benchmark, where the prompt example is the actual prompt for this image. The prompt example is appended with either MCQ or True/False type question.
  • Figure 5: The MMMU dataset is a set of visual question-answering task that is comprehensiveness across 11.5K college-level problems across six broad disciplines and 30 subject-areas, and it requires detailed image understanding and reasoning requiring in deep subject knowledge.
  • ...and 30 more figures