Table of Contents
Fetching ...

IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

Hai Lin, Shaoxiong Zhan, Junyou Su, Haitao Zheng, Hui Wang

TL;DR

This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks, and introduced new metrics: the Similarity of Semantic Comprehension Index (SSCI) and the Retrieval Capability Contest Index (RCCI).

Abstract

In Retrieval-Augmented Generation (RAG) tasks using Large Language Models (LLMs), the quality of retrieved information is critical to the final output. This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks. The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval. Our research addresses the current lack of comprehensive testing and effective comparison methods for embedding models in RAG scenarios. We introduced new metrics: the Similarity of Semantic Comprehension Index (SSCI) and the Retrieval Capability Contest Index (RCCI), and evaluated models such as Snowflake-Arctic, BGE, GTE, and M3E. Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models. The IRSC benchmark aims to enhance the understanding and development of accurate retrieval systems in RAG tasks. All code and datasets are available at: https://github.com/Jasaxion/IRSC_Benchmark

IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

TL;DR

This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks, and introduced new metrics: the Similarity of Semantic Comprehension Index (SSCI) and the Retrieval Capability Contest Index (RCCI).

Abstract

In Retrieval-Augmented Generation (RAG) tasks using Large Language Models (LLMs), the quality of retrieved information is critical to the final output. This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks. The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval. Our research addresses the current lack of comprehensive testing and effective comparison methods for embedding models in RAG scenarios. We introduced new metrics: the Similarity of Semantic Comprehension Index (SSCI) and the Retrieval Capability Contest Index (RCCI), and evaluated models such as Snowflake-Arctic, BGE, GTE, and M3E. Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models. The IRSC benchmark aims to enhance the understanding and development of accurate retrieval systems in RAG tasks. All code and datasets are available at: https://github.com/Jasaxion/IRSC_Benchmark
Paper Structure (18 sections, 2 equations, 5 figures, 2 tables)

This paper contains 18 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The IRSC Benchmark is structured around five primary task types, each designed to evaluate different aspects of a model's retrieval capabilities. The red labels indicate the languages and quantities of each dataset.
  • Figure 2: Comparative Performance Radar Charts of S-Arctic Series, BGE Series, GTE Series, M3E Series, and MiniLM Series Models Across IRSC Benchmark's Query, Title, Part, Keyword and Summary Tasks in Mixed-Language. Metrics: Average of Recall@10, MRR@10 and nDCG@10
  • Figure 3: Comparative SSCI Heatmaps of the S-Arctic Series, BGE Series, GTE Series, M3E Series, and MiniLM Series in the IRSC Benchmark's Summary Subtask Across Chinese and English. Smaller values indicate more consistent model performance.
  • Figure 4: Comparative SSCI Heatmaps of the S-Arctic Series, BGE Series, GTE Series, M3E Series, and MiniLM Series in the IRSC Benchmark's Query, Title, Part and Keyword Subtasks in English. Smaller values indicate more consistent model performance.
  • Figure 5: Comparison of RCCI Results Between S-Arctic-S and M3E-Small Across Mixed-Languages, Chinese, and English