MIEB: Massive Image Embedding Benchmark
Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
TL;DR
The paper introduces MIEB, a universal benchmark for image and image-text embeddings that covers 130 tasks across 38 languages and eight capability categories, enabling broad evaluation beyond traditional retrieval and classification. It analyzes 50 models from vision-only, CLIP, and MLLM-based families, showing no single method dominates all task types and revealing strengths and limitations across categories such as Visual STS, OCR-based document understanding, and interleaved embeddings. A key finding is the strong correlation between vision-encoder performance on MIEB and downstream MLLM performance, suggesting MIEB as a practical proxy for selecting encoders for multimodal models. The work also introduces MIEB-lite for efficient benchmarking and demonstrates that larger scales, data quality, and training recipes influence results, with implications for pursuing universal embedding models. Public code, data, and leaderboards are provided to support ongoing benchmarking and progress in universal image-text representations.
Abstract
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
