Table of Contents
Fetching ...

HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks

Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, Kenneth Enevoldsen

TL;DR

Embedding benchmarks currently lack human baselines, hindering interpretation of scores. HUME provides a generalizable, human-centric evaluation framework for 16 MTEB datasets across reranking, classification, clustering, and STS, with multi-annotator assessments and an LLM-annotator study. Findings show humans average 77.6% performance, slightly below the best embedding models, with large cross-task and cross-language variation influenced by dataset quality; LLMs approach some tasks but lag on high-agreement reranking and cannot fully replace human judgments. The work offers concrete guidance for designing more reliable benchmarks, such as prioritizing high-agreement tasks, reporting dataset quality, and adopting hybrid human–LLM annotation pipelines, with code and leaderboards available publicly.

Abstract

Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks

TL;DR

Embedding benchmarks currently lack human baselines, hindering interpretation of scores. HUME provides a generalizable, human-centric evaluation framework for 16 MTEB datasets across reranking, classification, clustering, and STS, with multi-annotator assessments and an LLM-annotator study. Findings show humans average 77.6% performance, slightly below the best embedding models, with large cross-task and cross-language variation influenced by dataset quality; LLMs approach some tasks but lag on high-agreement reranking and cannot fully replace human judgments. The work offers concrete guidance for designing more reliable benchmarks, such as prioritizing high-agreement tasks, reporting dataset quality, and adopting hybrid human–LLM annotation pipelines, with code and leaderboards available publicly.

Abstract

Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

Paper Structure

This paper contains 51 sections, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Human performance versus 13 embedding models across 16 tasks. Humans rank 4th (77.6), showing competitive but not dominant performance. Darker shades indicate larger models.
  • Figure 2: Comprehensive view of human performance relative to all model performance ranges across 16 tasks by language.
  • Figure 3: Emotion Classification annotation interface showing the 6-category emotion labeling task. This task achieved fair inter-annotator agreement ($\kappa=0.39$) due to ambiguous emotional states and mixed emotions in social media text. Human performance: 45.8%, Best model: 87.1%.
  • Figure 4: Tweet Sentiment Classification annotation interface demonstrating sentiment polarity annotation. This task achieved moderate inter-annotator agreement ($\kappa=0.48$) with reasonable consensus on positive/negative sentiment. Human performance: 84.4%, Best model: 90.9%.
  • Figure 5: ArXiv Clustering annotation interface showing academic papers that caused complete annotator disagreement ($\text{ARI}=-0.001$) due to interdisciplinary research overlap. Papers could be categorized by methodology, application domain, or research community, leading to fundamental disagreement. Human performance: 49.2%, Best model: 84.6%.
  • ...and 11 more figures