LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

Hieu Man; Nghia Trung Ngo; Viet Dac Lai; Ryan A. Rossi; Franck Dernoncourt; Thien Huu Nguyen

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen

TL;DR

LUSIFER tackles the pervasive English-centric bias in LLM-based embeddings by aligning a language-universal multilingual encoder with an English-centric embedding-oriented LLM through a lightweight connector. The method employs a two-stage training regime—alignment on English data with a masked reconstruction and autoregressive objective, followed by contrastive representation finetuning with bidirectional attention and LoRA-based efficiency—to produce true multilingual embeddings without multilingual supervision. Evaluated on a comprehensive benchmark of 5 tasks, 123 datasets, and 14 languages (with cross-lingual coverage over 100 languages), LUSIFER achieves a mean score of $62.63$, outperforming prior English-centric baselines and showing substantial gains for medium/low-resource languages (e.g., Telugu +$22.15$) and cross-lingual retrieval (avg $57.89$, +$5.75$). The work demonstrates robust language-agnostic representations and suggests broad practical impact for multilingual NLP tasks, with future work extending modalities and integrating with other state-of-the-art models.

Abstract

Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

TL;DR

, outperforming prior English-centric baselines and showing substantial gains for medium/low-resource languages (e.g., Telugu +

) and cross-lingual retrieval (avg

, +

). The work demonstrates robust language-agnostic representations and suggests broad practical impact for multilingual NLP tasks, with future work extending modalities and integrating with other state-of-the-art models.

Abstract

Paper Structure (18 sections, 6 figures, 19 tables)

This paper contains 18 sections, 6 figures, 19 tables.

Introduction
Related Work
English-centric Embedding Models
Zero-shot Multilingual Embedding
Multilingual Embedding Benchmarks
Methodology
Model Architecture
Training Pipeline
Experiment
Benchmark
Experimental Setup
Main Results
Cross-Lingual Evaluation
Task-Specific Performance
Ablation Study
...and 3 more sections

Figures (6)

Figure 1: Overview of LUSIFER. Left: Align a multilingual encoder with the target English-centric LLM only using English data and a minimal set of trainable parameter. Center: End-to-end representation finetune through contrastive learning on English text-embedding tasks using LoRA. Right: During inference, LUSIFER successfully processes text-embedding tasks across multiple languages.
Figure 2: Overview of tasks and datasets in our benchmark. Crosslingual datasets are marked with a blue shade.
Figure 3: Performance comparison of LUSIFER and baseline models on Classification and Clustering tasks.
Figure 4: Performance comparison of LUSIFER and baseline models on Reranking tasks.
Figure 5: Performance comparison of LUSIFER and baseline models on Retrieval and STS tasks.
...and 1 more figures

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

TL;DR

Abstract

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)