Table of Contents
Fetching ...

How Lexical is Bilingual Lexicon Induction?

Harsh Kohli, Helian Feng, Nicholas Dronen, Calvin McCarter, Sina Moeini, Ali Kebarighotbi

TL;DR

This work addresses bilingual lexicon induction under hubness by augmenting a retrieve-and-rank framework with lexical cues. It introduces Lexical-Feature Boosted BLI (LFBB), which adds term-frequency and part-of-speech features to a listwise learning-to-rank model, combining a fastText-based retriever and a cross-encoder reranker. On XLING, LFBB improves over prior baselines, with notable gains from POS and frequency features across many language pairs, indicating the practical value of lexical signals in cross-lingual lexicon induction. The approach is simple and scalable, though limited by linear modeling assumptions and data quality in XLING, suggesting avenues for more expressive modeling and broader evaluation.

Abstract

In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.

How Lexical is Bilingual Lexicon Induction?

TL;DR

This work addresses bilingual lexicon induction under hubness by augmenting a retrieve-and-rank framework with lexical cues. It introduces Lexical-Feature Boosted BLI (LFBB), which adds term-frequency and part-of-speech features to a listwise learning-to-rank model, combining a fastText-based retriever and a cross-encoder reranker. On XLING, LFBB improves over prior baselines, with notable gains from POS and frequency features across many language pairs, indicating the practical value of lexical signals in cross-lingual lexicon induction. The approach is simple and scalable, though limited by linear modeling assumptions and data quality in XLING, suggesting avenues for more expressive modeling and broader evaluation.

Abstract

In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.
Paper Structure (11 sections, 4 figures, 2 tables)

This paper contains 11 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Spearman's Rank correlation of term frequencies derived from Common Crawl and Wikipedia. The plot visualizes the Spearman's Rank correlation of term frequencies between each of the source (by row) and target (by column) language pairs in the 5k vocabularies in the XLING corpus derived from Common Crawl and Wikipedia. We calculate and plot the correlation heatmap separately by part of speech in each of the sub-figures. Cells containing a 0 have an insufficient (<10) number of terms in the source language for a particular part of speech.
  • Figure 2: Principle Component Analysis (PCA) of source, target word and Nearest Neighbours (NN) of source word in the embedding space. The source word, target word, and nearest neighbours are distinguished by the shape and color of points (as shown in the legend). In the left panel, we keep the size and transparency of all points the same. In the right panel, the size of the dots are scaled with the likelihood of matching POS between source and target; and the alpha (transparency) of the dots as the normalised frequency difference of source-candidate pair.
  • Figure 3: Mean absolute difference of term frequency. The figure plots the mean absolute difference of term frequency between the source and target word for the ground truth, LFBB (ours), and XLM-R.
  • Figure 4: Nearest Neighbours of "motions" with (left) and without (right) lexical features.