Sailor: Open Language Models for South-East Asia

Longxu Dou; Qian Liu; Guangtao Zeng; Jia Guo; Jiahui Zhou; Wei Lu; Min Lin

Sailor: Open Language Models for South-East Asia

Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, Min Lin

TL;DR

Sailor addresses the challenge of building open large language models that perform well across South-East Asian languages by combining continual pre-training from Qwen1.5 with a SEA-focused SailCraft corpus of roughly $200$B tokens and a replay corpus. The approach integrates data-centric techniques (merging adjacent short examples, document-level code-switching, aggressive cleaning and deduplication), tokenization strategies (BPE dropout), and data-mix optimization via proxy models and a linear-regression-based search, while navigating the curse of multilinguality through careful learning-rate and data-proportion tuning expressed through the magic metric $\log( ext{Source Proportion}) - \log(\text{Learning Rate})$. The framework yields Sailor variants from 0.5B to 7B that outperform baselines on SEA benchmarks across QA, commonsense, reading, and exams, and is released openly to spur further multilingual SEA research and development. This work demonstrates practical strategies to improve multilinguality in low- to mid-resource languages and highlights the importance of robust data curation, code-switching techniques, and data-mixture simulation for future SEA-focused LLMs.

Abstract

We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.

Sailor: Open Language Models for South-East Asia

TL;DR

B tokens and a replay corpus. The approach integrates data-centric techniques (merging adjacent short examples, document-level code-switching, aggressive cleaning and deduplication), tokenization strategies (BPE dropout), and data-mix optimization via proxy models and a linear-regression-based search, while navigating the curse of multilinguality through careful learning-rate and data-proportion tuning expressed through the magic metric

. The framework yields Sailor variants from 0.5B to 7B that outperform baselines on SEA benchmarks across QA, commonsense, reading, and exams, and is released openly to spur further multilingual SEA research and development. This work demonstrates practical strategies to improve multilinguality in low- to mid-resource languages and highlights the importance of robust data curation, code-switching techniques, and data-mixture simulation for future SEA-focused LLMs.

Abstract

Paper Structure (53 sections, 9 figures, 8 tables, 1 algorithm)

This paper contains 53 sections, 9 figures, 8 tables, 1 algorithm.

Introduction
Insights
Data
Merging Adjacent Short Examples
Code-Switching
Aggressive Data Cleaning and Deduplication
Tokenization
BPE Dropout
Vocabulary Expansion
Training
The Curse of Multilinguality
Learning Rate Tuning
Data Mixture Simulation
Best Practise for Continual Pre-training
Data Sources
...and 38 more sections

Figures (9)

Figure 1: The pipeline of building Sailor, with insights marked by stars.
Figure 2: Initially Sailor models were trained on 200B tokens using a greedy tokenization strategy. Subsequently, they were fine-tuned using BPE dropout for an additional 2B tokens, with the dropout rate as 0.1. As observed, BPE dropout improves the robustness.
Figure 3: We initially pre-train a 120M model using a corpus of 20B tokens focusing on English. Subsequently, we continually pre-train the model using a mixed corpus comprising both English and SEA languages. Each data point here corresponds to a different configuration of data mixture and learning rate. As indicated, under a fixed total tokens, there is a trade-off between the model's performance on English and SEA languages.
Figure 4: Under the same token budget, we observe that (a) the validation loss on English can be modeled as a quadratic function of $\log(\textrm{English Proportion})-\log(\textrm{Learning Rate})$; (b) the validation loss on SEA languages, using Malay as an example, can be approximately represented by a quadratic function with $\log(\textrm{Malay Proportion})+\log(\textrm{Learning Rate})$; (c) we can tune the learning rate by analyzing the learning curves on SEA languages.
Figure 5: We employ the experimental results from proxy models across a variety of data mixtures (e.g., 64 distinct data mixture here) to fit a linear regression model. The model is then utilized to predict the validation loss of simulate numerous random data mixtures, enabling us to identify the most effective data mixture for optimizing joint loss. Subsequently, the best data mixture is applied to large-scale training.
...and 4 more figures

Sailor: Open Language Models for South-East Asia

TL;DR

Abstract

Sailor: Open Language Models for South-East Asia

Authors

TL;DR

Abstract

Table of Contents

Figures (9)