Mapping Transformer Leveraged Embeddings for Cross-Lingual Document Representation
Tsegaye Misikir Tashu, Eduard-Raul Kontos, Matthia Sabatelli, Matias Valdenegro-Toro
TL;DR
The paper tackles cross-lingual document recommendation by proposing Transformer Leveraged Document Representations (TLDRs) mapped into a cross-lingual space to compare documents across languages. It combines four pretrained multilingual transformers (mBERT, mT5, XLM-RoBERTa, ErnieM) with three mapping strategies (LCA, LCC, NCA) to produce cross-lingual document embeddings evaluated on 20 EU language pairs. Results show that mapping-based cross-lingual representations significantly outperform non-mapped baselines, with the Linear Concept Approximation (LCA) method often delivering the strongest performance, particularly for mBERT. The work demonstrates a computationally efficient approach to cross-lingual information retrieval that does not rely on extensive fine-tuning, and suggests future work on low-resource languages and language tuples to broaden applicability.
Abstract
Recommendation systems, for documents, have become tools to find relevant content on the Web. However, these systems have limitations when it comes to recommending documents in languages different from the query language, which means they might overlook resources in non-native languages. This research focuses on representing documents across languages by using Transformer Leveraged Document Representations (TLDRs) that are mapped to a cross-lingual domain. Four multilingual pre-trained transformer models (mBERT, mT5 XLM RoBERTa, ErnieM) were evaluated using three mapping methods across 20 language pairs representing combinations of five selected languages of the European Union. Metrics like Mate Retrieval Rate and Reciprocal Rank were used to measure the effectiveness of mapped TLDRs compared to non-mapped ones. The results highlight the power of cross-lingual representations achieved through pre-trained transformers and mapping approaches suggesting a promising direction for expanding beyond language connections, between two specific languages.
