Table of Contents
Fetching ...

MIEB: Massive Image Embedding Benchmark

Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff

TL;DR

The paper introduces MIEB, a universal benchmark for image and image-text embeddings that covers 130 tasks across 38 languages and eight capability categories, enabling broad evaluation beyond traditional retrieval and classification. It analyzes 50 models from vision-only, CLIP, and MLLM-based families, showing no single method dominates all task types and revealing strengths and limitations across categories such as Visual STS, OCR-based document understanding, and interleaved embeddings. A key finding is the strong correlation between vision-encoder performance on MIEB and downstream MLLM performance, suggesting MIEB as a practical proxy for selecting encoders for multimodal models. The work also introduces MIEB-lite for efficient benchmarking and demonstrates that larger scales, data quality, and training recipes influence results, with implications for pursuing universal embedding models. Public code, data, and leaderboards are provided to support ongoing benchmarking and progress in universal image-text representations.

Abstract

Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

MIEB: Massive Image Embedding Benchmark

TL;DR

The paper introduces MIEB, a universal benchmark for image and image-text embeddings that covers 130 tasks across 38 languages and eight capability categories, enabling broad evaluation beyond traditional retrieval and classification. It analyzes 50 models from vision-only, CLIP, and MLLM-based families, showing no single method dominates all task types and revealing strengths and limitations across categories such as Visual STS, OCR-based document understanding, and interleaved embeddings. A key finding is the strong correlation between vision-encoder performance on MIEB and downstream MLLM performance, suggesting MIEB as a practical proxy for selecting encoders for multimodal models. The work also introduces MIEB-lite for efficient benchmarking and demonstrates that larger scales, data quality, and training recipes influence results, with implications for pursuing universal embedding models. Public code, data, and leaderboards are provided to support ongoing benchmarking and progress in universal image-text representations.

Abstract

Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

Paper Structure

This paper contains 59 sections, 12 figures, 23 tables.

Figures (12)

  • Figure 1: Overview of MIEB task categories with examples. See \ref{['tab:MIEB big tasks']} for details about capabilities measured and other information.
  • Figure 2: UMAP Visualization of ImageNet Dog15. Each class corresponds to one dog breed. CLIP clusters are more distinct.
  • Figure 3: Linear probing performance across different shots k. We select representative models from our vision-only and CLIP categories (\ref{['sec:models']}). See \ref{['subsec: k-shot']} for details on fine-grained and coarse-grained tasks.
  • Figure 4: Correlations between performance on generative MLLM benchmarks from tong2024cambrian (y-axis) and our Visual STS (x-axis). High correlation means that our Visual STS tasks can predict generative performance.
  • Figure 5: T2I Retrieval example from MSCOCOT2IRetrieval task.
  • ...and 7 more figures