Table of Contents
Fetching ...

Cross-lingual paraphrase identification

Inessa Fedorova, Aleksei Musatow

TL;DR

This paper tackles cross-lingual paraphrase identification by training a multilingual bi-encoder with contrastive learning to rival cross-encoder performance. It introduces a modified Additive Margin Softmax objective with in-batch and hard-negative mining, leveraging mega-batching and a strengthened loss to improve semantic alignment. Evaluations on PAWS-X show the bi-encoder achieves performance close to state-of-the-art cross-encoders, with a modest 7–10% relative drop, while offering substantial benefits in embedding quality and offline applicability. The work demonstrates effective multilingual semantic similarity modeling suitable for scalable semantic search and cross-lingual applications.

Abstract

The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.

Cross-lingual paraphrase identification

TL;DR

This paper tackles cross-lingual paraphrase identification by training a multilingual bi-encoder with contrastive learning to rival cross-encoder performance. It introduces a modified Additive Margin Softmax objective with in-batch and hard-negative mining, leveraging mega-batching and a strengthened loss to improve semantic alignment. Evaluations on PAWS-X show the bi-encoder achieves performance close to state-of-the-art cross-encoders, with a modest 7–10% relative drop, while offering substantial benefits in embedding quality and offline applicability. The work demonstrates effective multilingual semantic similarity modeling suitable for scalable semantic search and cross-lingual applications.

Abstract

The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.
Paper Structure (15 sections, 4 equations, 1 figure, 2 tables, 2 algorithms)