Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model
Yen-Ting Lin, Yun-Nung Chen
TL;DR
Taiwan-LLM addresses the underrepresentation of Traditional Chinese (zh-TW) in large language models by pretraining on Taiwan-specific text and applying culturally aligned instruction fine-tuning. The approach uses a three-phase pipeline—continue-pretraining on a Taiwanese corpus, supervised fine-tuning on multi-turn dialogues, and feedback-driven fine-tuning from real users—built on a Llama 2 foundation. It achieves a competitive 13B-parameter performance on TC-Eval benchmarks (average around 53.99%), approaching or matching some proprietary models despite a smaller size, and outperforms open-source zh-cn focused baselines. The open-source release and accompanying datasets are intended to empower researchers and communities, while results emphasize the crucial role of data quality and targeted linguistic-cultural alignment for regional language technologies.
Abstract
In the realm of language models, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. Taiwan LLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that Taiwan LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of Taiwan LLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.
