Table of Contents
Fetching ...

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh

TL;DR

This work introduces MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities, and proposes multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions.

Abstract

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

TL;DR

This work introduces MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities, and proposes multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions.

Abstract

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

Paper Structure

This paper contains 26 sections, 46 figures, 2 tables.

Figures (46)

  • Figure 1: MixEval-X encompasses eight input-output modality combinations and can be further extended. Its data points reflect real-world task distributions. The last grid presents the scores of frontier organizations' flagship models on MixEval-X, normalized to a 0-100 scale, with MMG tasks using win rates instead of Elo. Section \ref{['sec:case_study']} presents example data samples and model responses.
  • Figure 2: The overall pipeline for creating MixEval-X.
  • Figure 3: The evaluation results of prominent models on MixEval-X Image2Text, Image2Text-Hard, and their subsets. Proprietary models are highlighted in blue. See Section \ref{['sec:eval_models']} for details.
  • Figure 4: The evaluation results of prominent models on MixEval-X Video2Text, Video2Text-Hard, and their subsets. Proprietary models are highlighted in blue. See Section \ref{['sec:eval_models']} for details.
  • Figure 5: The results of prominent models on MixEval-X Audio2Text, Audio2Text-Hard, and their subsets. Proprietary models are highlighted in blue. See Section \ref{['sec:eval_models']} for details.
  • ...and 41 more figures