Table of Contents
Fetching ...

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, Lidong Bing

TL;DR

SeaLLMs 3 introduces an efficient language-enhancement strategy based on Language-Specific Neurons (LSNs) to expand Southeast Asian language coverage while preserving high-resource language capabilities. Built on the Qwen2 foundation, it uses targeted SEA LSN training and a carefully constructed instruction-tuning dataset to achieve state-of-the-art performance for models of similar size across multilingual world knowledge, math, instruction-following, and translation tasks. The work emphasizes safety and reliability through SeaRefuse and MultiJail benchmarks, reducing hallucinations and enabling culturally appropriate responses. By open-sourcing both foundational and chat models, SeaLLMs 3 aims to accelerate inclusive AI development for diverse SEA languages and communities.

Abstract

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

TL;DR

SeaLLMs 3 introduces an efficient language-enhancement strategy based on Language-Specific Neurons (LSNs) to expand Southeast Asian language coverage while preserving high-resource language capabilities. Built on the Qwen2 foundation, it uses targeted SEA LSN training and a carefully constructed instruction-tuning dataset to achieve state-of-the-art performance for models of similar size across multilingual world knowledge, math, instruction-following, and translation tasks. The work emphasizes safety and reliability through SeaRefuse and MultiJail benchmarks, reducing hallucinations and enabling culturally appropriate responses. By open-sourcing both foundational and chat models, SeaLLMs 3 aims to accelerate inclusive AI development for diverse SEA languages and communities.

Abstract

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.
Paper Structure (26 sections, 2 figures, 9 tables)

This paper contains 26 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Language-Specific Neuron Training.
  • Figure 2: Language distribution of the SFT data