MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Jinjie Ni; Yifan Song; Deepanway Ghosal; Bo Li; David Junhao Zhang; Xiang Yue; Fuzhao Xue; Zian Zheng; Kaichen Zhang; Mahir Shah; Kabir Jain; Yang You; Michael Shieh

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh

TL;DR

This work introduces MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities, and proposes multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions.

Abstract

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

TL;DR

Abstract

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (46)