How Lexical is Bilingual Lexicon Induction?
Harsh Kohli, Helian Feng, Nicholas Dronen, Calvin McCarter, Sina Moeini, Ali Kebarighotbi
TL;DR
This work addresses bilingual lexicon induction under hubness by augmenting a retrieve-and-rank framework with lexical cues. It introduces Lexical-Feature Boosted BLI (LFBB), which adds term-frequency and part-of-speech features to a listwise learning-to-rank model, combining a fastText-based retriever and a cross-encoder reranker. On XLING, LFBB improves over prior baselines, with notable gains from POS and frequency features across many language pairs, indicating the practical value of lexical signals in cross-lingual lexicon induction. The approach is simple and scalable, though limited by linear modeling assumptions and data quality in XLING, suggesting avenues for more expressive modeling and broader evaluation.
Abstract
In contemporary machine learning approaches to bilingual lexicon induction (BLI), a model learns a mapping between the embedding spaces of a language pair. Recently, retrieve-and-rank approach to BLI has achieved state of the art results on the task. However, the problem remains challenging in low-resource settings, due to the paucity of data. The task is complicated by factors such as lexical variation across languages. We argue that the incorporation of additional lexical information into the recent retrieve-and-rank approach should improve lexicon induction. We demonstrate the efficacy of our proposed approach on XLING, improving over the previous state of the art by an average of 2\% across all language pairs.
