Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Imalsha Puranegedara; Themira Chathumina; Nisal Ranathunga; Nisansa de Silva; Surangika Ranathunga; Mokanarangan Thayaparan

Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Imalsha Puranegedara, Themira Chathumina, Nisal Ranathunga, Nisansa de Silva, Surangika Ranathunga, Mokanarangan Thayaparan

TL;DR

The paper addresses the gap in multilingual understanding for English-centric LLMs by fusing all intermediate layers of a frozen multilingual encoder into an English-focused LLM, enabling zero-shot multilingual inference with English-only training. It introduces two fusion strategies—Global Softmax and Token-Wise Softmax—followed by mapping the fused representations into the LLM embedding space using a learned projection, and it trains only the fusion components with a prefix language modeling objective. Empirically, the approach yields consistent improvements across XNLI, IndicXNLI, Sinhala News, and Amazon Reviews, with the Transformer Softmax variant achieving the strongest gains (e.g., Sinhala accuracy rising from $71.66\%$ to $75.86\%$ and average XNLI accuracy from $70.36\%$ to $71.50\%$). These results demonstrate the value of leveraging multi-layer, language-agnostic signals for cross-lingual transfer in a data-efficient, zero-shot setting, offering a scalable path toward more capable and equitable multilingual LLMs. The work also notes that larger multilingual LLMs and script-diverse languages present promising avenues for future exploration.

Abstract

Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

TL;DR

Abstract

Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)