Table of Contents
Fetching ...

Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

Rajan Agarwal, Aarush Gupta

TL;DR

LLINK addresses the cross-lingual gap for low-resource, non-Latin languages in instruction-tuned LLMs caused by tokenizer fragmentation and weak cross-lingual coupling. It introduces a two-stage encoder-injection approach: Stage A performs a contrastive alignment from a frozen multilingual encoder into the LLM latent space at a reserved decoder position, and Stage B expands this signal into $K=8$ soft slots with lightweight adapters and a usage-enforcement objective to ensure actual use by the decoder. On Khmer-English tasks, LLINK yields large gains in bilingual retrieval (R@1 up to $0.450$ and MRR up to $0.660$) and is preferred by an LLM judge (81.3% vs base, 63.6% vs direct fine-tuning), while reducing decoder token usage by about $3\times$. This work demonstrates a practical, compute-efficient path to stronger cross-lingual alignment for low-resource languages without tokenizers or decoder retraining, though lexical fidelity and numeric precision remain areas for future improvement.

Abstract

Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.

Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

TL;DR

LLINK addresses the cross-lingual gap for low-resource, non-Latin languages in instruction-tuned LLMs caused by tokenizer fragmentation and weak cross-lingual coupling. It introduces a two-stage encoder-injection approach: Stage A performs a contrastive alignment from a frozen multilingual encoder into the LLM latent space at a reserved decoder position, and Stage B expands this signal into soft slots with lightweight adapters and a usage-enforcement objective to ensure actual use by the decoder. On Khmer-English tasks, LLINK yields large gains in bilingual retrieval (R@1 up to and MRR up to ) and is preferred by an LLM judge (81.3% vs base, 63.6% vs direct fine-tuning), while reducing decoder token usage by about . This work demonstrates a practical, compute-efficient path to stronger cross-lingual alignment for low-resource languages without tokenizers or decoder retraining, though lexical fidelity and numeric precision remain areas for future improvement.

Abstract

Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.

Paper Structure

This paper contains 21 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of LLINK Architecture, passing Multilingual text through a projection model to match LLaMA's embedding space, then to the LLM to produce an output using the translated tokens. Dotted lines represent train-time only.
  • Figure 2: Tokenization of the same sentence with the LLaMA-3.2-1B tokenizer — English: 16 tokens (0.3 tok/char); Khmer translit: 35 (0.5); Khmer: 104 (1.7). Dividers on Khmer show duplicate tokens mapping to the same character.
  • Figure 3: Analysis of fine-tuned representations with Khmer LLaMA 3.2 tokenization. The top three charts present layer-wise similarities, hidden state norms and residual changes. The bottom three charts present input embedding norms, token NLL scores and cosine similarities between Khmer, Khmer Latin transliteration and English translations of the same text.