Table of Contents
Fetching ...

ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nicholas Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee

TL;DR

ChemTEB targets the gap in chemistry-specific embedding evaluation by introducing a domain-focused benchmark with tasks spanning Classification, Pair Classification, Clustering, Retrieval, and Bitext Mining, using data from PubChem, chemistry-focused Wikipedia, CoconutDB, BeIR, and SDS. It evaluates 34 models (27 open-source, 7 proprietary) and ranks them via per-task averages and Reciprocal Rank Fusion with k=10, revealing that no single model dominates across all tasks and that domain adaptation yields limited gains outside niche tasks. Proprietary models often outperform open-source ones, while modern contrastive learning and architectures provide the strongest gains; SMILES-based bitext mining remains particularly challenging for general-domain embeddings. The open-source, extensible design of ChemTEB supports reproducible evaluation and targeted development of chemistry-aware embeddings for applications in literature mining, synthesis planning, and regulatory analyses.

Abstract

Recent advancements in language models have started a new era of superior information retrieval and content generation, with embedding models playing an important role in optimizing data representation efficiency and performance. While benchmarks like the Massive Text Embedding Benchmark (MTEB) have standardized the evaluation of general domain embedding models, a gap remains in specialized fields such as chemistry, which require tailored approaches due to domain-specific challenges. This paper introduces a novel benchmark, the Chemical Text Embedding Benchmark (ChemTEB), designed specifically for the chemical sciences. ChemTEB addresses the unique linguistic and semantic complexities of chemical literature and data, offering a comprehensive suite of tasks on chemical domain data. Through the evaluation of 34 open-source and proprietary models using this benchmark, we illuminate the strengths and weaknesses of current methodologies in processing and understanding chemical information. Our work aims to equip the research community with a standardized, domain-specific evaluation framework, promoting the development of more precise and efficient NLP models for chemistry-related applications. Furthermore, it provides insights into the performance of generic models in a domain-specific context. ChemTEB comes with open-source code and data, contributing further to its accessibility and utility.

ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

TL;DR

ChemTEB targets the gap in chemistry-specific embedding evaluation by introducing a domain-focused benchmark with tasks spanning Classification, Pair Classification, Clustering, Retrieval, and Bitext Mining, using data from PubChem, chemistry-focused Wikipedia, CoconutDB, BeIR, and SDS. It evaluates 34 models (27 open-source, 7 proprietary) and ranks them via per-task averages and Reciprocal Rank Fusion with k=10, revealing that no single model dominates across all tasks and that domain adaptation yields limited gains outside niche tasks. Proprietary models often outperform open-source ones, while modern contrastive learning and architectures provide the strongest gains; SMILES-based bitext mining remains particularly challenging for general-domain embeddings. The open-source, extensible design of ChemTEB supports reproducible evaluation and targeted development of chemistry-aware embeddings for applications in literature mining, synthesis planning, and regulatory analyses.

Abstract

Recent advancements in language models have started a new era of superior information retrieval and content generation, with embedding models playing an important role in optimizing data representation efficiency and performance. While benchmarks like the Massive Text Embedding Benchmark (MTEB) have standardized the evaluation of general domain embedding models, a gap remains in specialized fields such as chemistry, which require tailored approaches due to domain-specific challenges. This paper introduces a novel benchmark, the Chemical Text Embedding Benchmark (ChemTEB), designed specifically for the chemical sciences. ChemTEB addresses the unique linguistic and semantic complexities of chemical literature and data, offering a comprehensive suite of tasks on chemical domain data. Through the evaluation of 34 open-source and proprietary models using this benchmark, we illuminate the strengths and weaknesses of current methodologies in processing and understanding chemical information. Our work aims to equip the research community with a standardized, domain-specific evaluation framework, promoting the development of more precise and efficient NLP models for chemistry-related applications. Furthermore, it provides insights into the performance of generic models in a domain-specific context. ChemTEB comes with open-source code and data, contributing further to its accessibility and utility.

Paper Structure

This paper contains 25 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Distribution plots for five categories of tasks. The KDE plots show the probability density functions, where the x-axis represents the range of predicted values (performance distribution over tasks of each category and models of each family) and the y-axis represents the estimated density. Each colored line corresponds to a unique model family, enabling a clear visual comparison of their value distributions.
  • Figure 2: Summary of evaluated models in terms of efficiency. All evaluated models are depicted in the form of (i) circles (with circle size being proportional to the number of parameters) for open-source models, and (ii) stars for proprietary models. The color of the depicted models reflects their embedding dimension. The x-axis denotes the averaged inference speed (embedded samples/sec) calculated over seven pair classification tasks (tasks 29 - 35 in table \ref{['tab:datasets-summary']}) conducted on a V100 GPU machine.
  • Figure 3: Comparison of model performance on ChemTEB and MTEB benchmarks across different tasks. Each point represents a model from the intersection of those tested and those on the MTEB leaderboard as of the date. The figure highlights variations in task difficulty and domain specificity.
  • Figure S1: Comparison of model performance on ChemTEB and MTEB benchmarks across different tasks. Each point represents a model from the intersection of those tested and those on the MTEB leaderboard as of the date. The figure highlights variations in task difficulty and domain specificity.
  • Figure S2: Correlation Matrix across datasets. Each row and column represents a separate dataset tested in the ChemTEB benchmark. The values and associated color reflect the correlation between the performance of different models on each pair of these datasets.
  • ...and 1 more figures