Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal, Aarush Gupta
TL;DR
LLINK addresses the cross-lingual gap for low-resource, non-Latin languages in instruction-tuned LLMs caused by tokenizer fragmentation and weak cross-lingual coupling. It introduces a two-stage encoder-injection approach: Stage A performs a contrastive alignment from a frozen multilingual encoder into the LLM latent space at a reserved decoder position, and Stage B expands this signal into $K=8$ soft slots with lightweight adapters and a usage-enforcement objective to ensure actual use by the decoder. On Khmer-English tasks, LLINK yields large gains in bilingual retrieval (R@1 up to $0.450$ and MRR up to $0.660$) and is preferred by an LLM judge (81.3% vs base, 63.6% vs direct fine-tuning), while reducing decoder token usage by about $3\times$. This work demonstrates a practical, compute-efficient path to stronger cross-lingual alignment for low-resource languages without tokenizers or decoder retraining, though lexical fidelity and numeric precision remain areas for future improvement.
Abstract
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
