Table of Contents
Fetching ...

Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

Yen-Ting Lin, Yun-Nung Chen

TL;DR

Taiwan-LLM addresses the underrepresentation of Traditional Chinese (zh-TW) in large language models by pretraining on Taiwan-specific text and applying culturally aligned instruction fine-tuning. The approach uses a three-phase pipeline—continue-pretraining on a Taiwanese corpus, supervised fine-tuning on multi-turn dialogues, and feedback-driven fine-tuning from real users—built on a Llama 2 foundation. It achieves a competitive 13B-parameter performance on TC-Eval benchmarks (average around 53.99%), approaching or matching some proprietary models despite a smaller size, and outperforms open-source zh-cn focused baselines. The open-source release and accompanying datasets are intended to empower researchers and communities, while results emphasize the crucial role of data quality and targeted linguistic-cultural alignment for regional language technologies.

Abstract

In the realm of language models, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. Taiwan LLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that Taiwan LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of Taiwan LLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.

Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

TL;DR

Taiwan-LLM addresses the underrepresentation of Traditional Chinese (zh-TW) in large language models by pretraining on Taiwan-specific text and applying culturally aligned instruction fine-tuning. The approach uses a three-phase pipeline—continue-pretraining on a Taiwanese corpus, supervised fine-tuning on multi-turn dialogues, and feedback-driven fine-tuning from real users—built on a Llama 2 foundation. It achieves a competitive 13B-parameter performance on TC-Eval benchmarks (average around 53.99%), approaching or matching some proprietary models despite a smaller size, and outperforms open-source zh-cn focused baselines. The open-source release and accompanying datasets are intended to empower researchers and communities, while results emphasize the crucial role of data quality and targeted linguistic-cultural alignment for regional language technologies.

Abstract

In the realm of language models, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. Taiwan LLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that Taiwan LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of Taiwan LLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.
Paper Structure (17 sections, 3 equations, 1 figure, 4 tables)

This paper contains 17 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The three-phase methodology for Taiwan-LLM development: (1) cPT - Continue-Pretraining on a large-scale Taiwanese corpus with quality assurance checks, (2) SFT - Supervised Fine-Tuning on multi-turn dialogues through prompt datasets and LLM self-play, and (3) Feedback SFT - Enhancing model performance through real user interactions and subsequent refinement loops, leveraging native speaker insights for cultural and linguistic accuracy.