Table of Contents
Fetching ...

MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning

Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani

TL;DR

MERLIN introduces a two-stage curriculum alignment framework that fuses a multilingual encoder with a frozen LLM to enhance cross-lingual reasoning in low-resource languages. In Stage I, a lightweight connector is trained via a three-part curriculum—General Mapping, Question Alignment, and Task-aware Augmentation—to project encoder outputs into the LLM’s embedding space without updating the LLM. Stage II then applies DoRA-based, parameter-efficient fine-tuning inside the decoder, freezing the encoder and the LLM backbone while adapting a small set of low-rank weights. Across MGSM, MSVAMP, AfriMGSM, and AfriXNLI, MERLIN achieves state-of-the-art results and substantial gains over strong baselines, particularly in low-resource languages, while maintaining competitive performance in high-resource languages. The results highlight the importance of cross-lingual embedding alignment and mid-layer decoder adaptations for reliable multilingual reasoning, enabling efficient deployment with modest computational budgets. Limitations include reliance on machine-translated data and a limited scope of tasks, suggesting avenues for future multi-task, data-filtered, and broader-domain evaluations.

Abstract

Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy -- from general bilingual bitext to task-specific data -- and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning

TL;DR

MERLIN introduces a two-stage curriculum alignment framework that fuses a multilingual encoder with a frozen LLM to enhance cross-lingual reasoning in low-resource languages. In Stage I, a lightweight connector is trained via a three-part curriculum—General Mapping, Question Alignment, and Task-aware Augmentation—to project encoder outputs into the LLM’s embedding space without updating the LLM. Stage II then applies DoRA-based, parameter-efficient fine-tuning inside the decoder, freezing the encoder and the LLM backbone while adapting a small set of low-rank weights. Across MGSM, MSVAMP, AfriMGSM, and AfriXNLI, MERLIN achieves state-of-the-art results and substantial gains over strong baselines, particularly in low-resource languages, while maintaining competitive performance in high-resource languages. The results highlight the importance of cross-lingual embedding alignment and mid-layer decoder adaptations for reliable multilingual reasoning, enabling efficient deployment with modest computational budgets. Limitations include reliance on machine-translated data and a limited scope of tasks, suggesting avenues for future multi-task, data-filtered, and broader-domain evaluations.

Abstract

Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy -- from general bilingual bitext to task-specific data -- and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

Paper Structure

This paper contains 37 sections, 6 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of Merlin (Multilingual Embedding-Enhanced Reasoning for Language Integration Network), our two-stage framework for multilingual reasoning. In the Mapping Stage, a lightweight mapping layer is trained sequentially on three datasets—starting with general bilingual translation, followed by query-pair translation, and finally task-specific QA translation—to align the multilingual encoder’s outputs with the LLM’s embedding space. In the Embedding Enhancement Stage, this mapping “connector” remains frozen while the LLM body is fine-tuned via parameter-efficient PEFT on QA data, thereby strengthening cross-lingual reasoning.
  • Figure 2: Comparison of Merlin performance across five different multilingual encoders and Gemma 2 9B LLM.
  • Figure 3: T-SNE visualizations of sentence embeddings for mT5-xl, AfriTeVa-V2-Large, and NLLB200-distilled-1.3B, showing how LRL data impacts cross-lingual alignment. More data reduces clustering, bringing low-resource languages closer to English, enhancing transfer learning.
  • Figure 4: Layer-wise cross-lingual retrieval@$5$. Scores are averaged over English and 16 African languages. Merlin achieves the highest mid-layer alignment, whereas MindMerger peaks only in the final decoder blocks.