Table of Contents
Fetching ...

MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

Xin Huang, Tarun Kumar Vangani, Minh Duc Pham, Xunlong Zou, Bin Wang, Zhengyuan Liu, Ai Ti Aw

TL;DR

The paper tackles cross-lingual understanding in multilingual LLMs by introducing MERaLiON-TextLLM, a family rooted in LLaMA-3 that uses continued pre-training and weight merging to boost Chinese, Indonesian, Malay, and Singlish capabilities. The authors show that instruction-tuning alone does not beat strong baselines, but merging pretrained and instructed weights yields superior performance across Cross-MMLU, Cross-LogiQA, IndoMMLU, and CN-Eval, while maintaining resource efficiency on TPU/GPU infrastructure. They provide open-source checkpoints to support further research and demonstrate balanced multilingual learning through deliberate corpus distribution and replay strategies. The work highlights a practical pathway for developing resource-efficient, domain-adaptive, multilingual LLMs with improved cross-lingual reasoning and knowledge coverage, particularly for Southeast Asian languages. Applications to translation, summarization, and content analytics are anticipated as part of future impact.

Abstract

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

TL;DR

The paper tackles cross-lingual understanding in multilingual LLMs by introducing MERaLiON-TextLLM, a family rooted in LLaMA-3 that uses continued pre-training and weight merging to boost Chinese, Indonesian, Malay, and Singlish capabilities. The authors show that instruction-tuning alone does not beat strong baselines, but merging pretrained and instructed weights yields superior performance across Cross-MMLU, Cross-LogiQA, IndoMMLU, and CN-Eval, while maintaining resource efficiency on TPU/GPU infrastructure. They provide open-source checkpoints to support further research and demonstrate balanced multilingual learning through deliberate corpus distribution and replay strategies. The work highlights a practical pathway for developing resource-efficient, domain-adaptive, multilingual LLMs with improved cross-lingual reasoning and knowledge coverage, particularly for Southeast Asian languages. Applications to translation, summarization, and content analytics are anticipated as part of future impact.

Abstract

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.
Paper Structure (6 sections, 1 figure, 5 tables)

This paper contains 6 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Distribution of Tokens