Table of Contents
Fetching ...

FaMTEB: Massive Text Embedding Benchmark in Persian Language

Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, Arash Amini

TL;DR

FaMTEB addresses the lack of a comprehensive Persian text-embedding benchmark by extending MTEB with $63$ datasets across $7$ tasks and introducing the novel $Summary retrieval$ task. It assembles $39$ new Persian datasets via web collection, translation, and LLM generation, and includes datasets for chatbot and RAG scenarios to reflect modern applications. The paper evaluates $15$ Persian or multilingual embedding models, identifies Jina as the top overall performer and highlights task-specific strengths of Persian-centric models, while providing an open-source benchmark with datasets, code, and a public leaderboard. This benchmark significantly enables robust evaluation and development of Persian NLP for retrieval-augmented systems and conversational AI.

Abstract

In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.

FaMTEB: Massive Text Embedding Benchmark in Persian Language

TL;DR

FaMTEB addresses the lack of a comprehensive Persian text-embedding benchmark by extending MTEB with datasets across tasks and introducing the novel task. It assembles new Persian datasets via web collection, translation, and LLM generation, and includes datasets for chatbot and RAG scenarios to reflect modern applications. The paper evaluates Persian or multilingual embedding models, identifies Jina as the top overall performer and highlights task-specific strengths of Persian-centric models, while providing an open-source benchmark with datasets, code, and a public leaderboard. This benchmark significantly enables robust evaluation and development of Persian NLP for retrieval-augmented systems and conversational AI.

Abstract

In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.

Paper Structure

This paper contains 35 sections, 1 equation, 13 figures, 8 tables.

Figures (13)

  • Figure 1: An overview of the FaMTEB evaluation dataset.
  • Figure 2: This figure illustrates the prompts used to generate the Synthetic Persian Chatbot Conversational Sentiment Analysis dataset. This dataset receives the chat subject, user tone, chatbot tone, and user emotion as input, corresponding to the placeholders "input subject", "input user tone", "input chatbot tone", and "emotion dictionary", respectively.
  • Figure 3: Figure a shows the distribution of the Query to Query data based on similarity scores. Figure b illustrates the distribution of the Query to Query data according to the labels assigned, ranging from 0 to 2.
  • Figure 4: This figure illustrates the prompts used to generate the three datasets: a) Synthetic Persian QA, b) Synthetic Persian Keywords/Tone, and c) Synthetic Persian STS. The prompts are in Persian, with their translations included below each prompt. In the prompts, the placeholder "input text" is replaced with the text intended to generate the data. Additionally, in prompt b, there is an "input tone" placeholder to specify the desired tone.
  • Figure 5: This figure illustrates the prompts used to generate the two datasets: a) Synthetic Persian Chatbot and b) Synthetic Persian Chatbot RAG. Both datasets receive the chat subject, user tone, and chatbot tone as input, corresponding to the placeholders "input subject", "input user tone", and "input chatbot tone", respectively. Additionally, in prompt a, the user's satisfaction level is provided through the placeholder "input satisfaction level", and in prompt b, the number of messages included in the chat history is specified using the placeholder "input number of history messages".
  • ...and 8 more figures