Table of Contents
Fetching ...

MINERS: Multilingual Language Models as Semantic Retrievers

Genta Indra Winata, Ruochen Zhang, David Ifeoluwa Adelani

TL;DR

The MINERS is introduced, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts, and demonstrates that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.

Abstract

Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.

MINERS: Multilingual Language Models as Semantic Retrievers

TL;DR

The MINERS is introduced, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts, and demonstrates that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.

Abstract

Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.
Paper Structure (47 sections, 1 equation, 7 figures, 18 tables)

This paper contains 47 sections, 1 equation, 7 figures, 18 tables.

Figures (7)

  • Figure 1: MINERS Benchmark tasks. In this example, we compare English (en) and Indonesian (id) texts across three tasks: (a) bitext retrieval, (b) retrieval-based classification, and (c) ICL classification. Light blue cubes represent vector representations of samples from the training dataset $\mathcal{D}_{train}$, generated by $\mathcal{M}$, while green, yellow, and red cubes denote raw text labels. The few-shot samples $f_i$ in task (c) are retrieved in the same manner as in task (b). The English translations of the text in the figure are as follows: "Saya suka kucing" ("I like cats"), "Saya suka anjing" ("I like dogs"), "Saya benci anjing" ("I hate dogs"), and "Kucing imut" ("Cute cats").
  • Figure 2: Results with different $k=[1,5,10]$ on bitext retrieval: (a) cross-lingual and (b) code-switching, retrieval-based classification: (c) monolingual, (d) cross-lingual, and (e) code-switching.
  • Figure 3: t-SNE representation of 200 randomly training samples from the NusaX dataset. The color on the figures show the sample ID for (a) and (b), language for (c) and (d), and class for (e) and (f).
  • Figure 4: ICL performance dynamics of BLOOMZ models on the NusaX dataset using context retrieved from various percentiles with E5$_\text{LARGE}$. Lower percentiles correspond to more semantically relevant samples.
  • Figure 5: t-SNE representation of 200 random samples from the NusaX dataset. The color on the figures show the sample ID for (a) and (b), language for (c) and (d), and class for (e) and (f).
  • ...and 2 more figures