Table of Contents
Fetching ...

Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples

Andrianos Michail, Simon Clematide, Rico Sennrich

TL;DR

The paper addresses the challenge of evaluating cross-lingual semantic search models on target language pairs and domains by introducing Cross-Lingual Semantic Discrimination (CLSD), a task that uses LLM-generated adversarial distractors to test whether the true parallel sentence can be correctly identified. It builds four German–French CLSD datasets in the news domain and analyzes direct cross-lingual retrieval versus English-pivot retrieval across multiple multilingual encoders, complemented by a linguistically informed perturbation study. The contributions include the CLSD task, the dataset release under AGPL-3.0, a pivot-versus-direct evaluation, and a fine-grained analysis linking linguistic perturbations to embedding behavior, offering practical guidance for region- and language-specific semantic search. Overall, CLSD provides a scalable, domain-sensitive framework for evaluating and stress-testing multilingual embeddings beyond standard benchmarks, highlighting model-dependent trade-offs between direct cross-lingual and pivot-based retrieval.

Abstract

The evaluation of cross-lingual semantic search models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. We introduce Cross-Lingual Semantic Discrimination (CLSD), a lightweight evaluation task that requires only parallel sentences and a Large Language Model (LLM) to generate adversarial distractors. CLSD measures an embedding model's ability to rank the true parallel sentence above semantically misleading but lexically similar alternatives. As a case study, we construct CLSD datasets for German--French in the news domain. Our experiments show that models fine-tuned for retrieval tasks benefit from pivoting through English, whereas bitext mining models perform best in direct cross-lingual settings. A fine-grained similarity analysis further reveals that embedding models differ in their sensitivity to linguistic perturbations. We release our code and datasets under AGPL-3.0: https://github.com/impresso/cross_lingual_semantic_discrimination

Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples

TL;DR

The paper addresses the challenge of evaluating cross-lingual semantic search models on target language pairs and domains by introducing Cross-Lingual Semantic Discrimination (CLSD), a task that uses LLM-generated adversarial distractors to test whether the true parallel sentence can be correctly identified. It builds four German–French CLSD datasets in the news domain and analyzes direct cross-lingual retrieval versus English-pivot retrieval across multiple multilingual encoders, complemented by a linguistically informed perturbation study. The contributions include the CLSD task, the dataset release under AGPL-3.0, a pivot-versus-direct evaluation, and a fine-grained analysis linking linguistic perturbations to embedding behavior, offering practical guidance for region- and language-specific semantic search. Overall, CLSD provides a scalable, domain-sensitive framework for evaluating and stress-testing multilingual embeddings beyond standard benchmarks, highlighting model-dependent trade-offs between direct cross-lingual and pivot-based retrieval.

Abstract

The evaluation of cross-lingual semantic search models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. We introduce Cross-Lingual Semantic Discrimination (CLSD), a lightweight evaluation task that requires only parallel sentences and a Large Language Model (LLM) to generate adversarial distractors. CLSD measures an embedding model's ability to rank the true parallel sentence above semantically misleading but lexically similar alternatives. As a case study, we construct CLSD datasets for German--French in the news domain. Our experiments show that models fine-tuned for retrieval tasks benefit from pivoting through English, whereas bitext mining models perform best in direct cross-lingual settings. A fine-grained similarity analysis further reveals that embedding models differ in their sensitivity to linguistic perturbations. We release our code and datasets under AGPL-3.0: https://github.com/impresso/cross_lingual_semantic_discrimination

Paper Structure

This paper contains 18 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Prompt 0: Distractor generation prompt for GPT-4
  • Figure 2: Change in cross-lingual cosine similarity between original and distractor sentence pairs with exactly one token swapped, grouped by the part of speech of the differing token.
  • Figure 3: Monolingual cosine similarity change in original-distractor sentence pairs with exactly one token swapped, grouped by the part of speech of the differing token.