Table of Contents
Fetching ...

Typhoon: Thai Large Language Models

Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarnmongkol, Ruangsak Patomwong, Pathomporn Chokchainant, Kasima Tharnpipitchai

TL;DR

Typhoon presents a Thai-specific, 7B-parameter LLM adapted from Mistral-7B, addressing data scarcity by a data-centric pretraining pipeline and a Thai-adapted tokenizer. It introduces ThaiExam to evaluate Thai knowledge in pretrained models and demonstrates that Typhoon is the best open-source Thai LLM, approaching GPT-3.5 performance while delivering higher Thai token efficiency. The paper also explores instruction-tuning via translation, self-instruction, and template-based data, showing strong instruction-following and zero-shot capabilities on translation, summarization, and QA tasks. Overall, Typhoon provides a practical, open, Thai-focused LLM with strong benchmarks and a scalable path to larger models and improved alignment.

Abstract

Typhoon is a series of Thai large language models (LLMs) developed specifically for the Thai language. This technical report presents challenges and insights in developing Thai LLMs, including data preparation, pretraining, instruction-tuning, and evaluation. As one of the challenges of low-resource languages is the amount of pretraining data, we apply continual training to transfer existing world knowledge from a strong LLM. To evaluate the Thai knowledge encapsulated in each model from the pretraining stage, we develop ThaiExam, a benchmark based on examinations for high-school students and investment professionals in Thailand. In addition, we fine-tune Typhoon to follow Thai instructions, and we evaluate instruction-tuned models on Thai instruction datasets as well as translation, summarization, and question-answering tasks. Experimental results on a suite of Thai benchmarks show that Typhoon outperforms all open-source Thai language models, and its performance is on par with GPT-3.5 in Thai while having only 7 billion parameters and being 2.62 times more efficient in tokenizing Thai text.

Typhoon: Thai Large Language Models

TL;DR

Typhoon presents a Thai-specific, 7B-parameter LLM adapted from Mistral-7B, addressing data scarcity by a data-centric pretraining pipeline and a Thai-adapted tokenizer. It introduces ThaiExam to evaluate Thai knowledge in pretrained models and demonstrates that Typhoon is the best open-source Thai LLM, approaching GPT-3.5 performance while delivering higher Thai token efficiency. The paper also explores instruction-tuning via translation, self-instruction, and template-based data, showing strong instruction-following and zero-shot capabilities on translation, summarization, and QA tasks. Overall, Typhoon provides a practical, open, Thai-focused LLM with strong benchmarks and a scalable path to larger models and improved alignment.

Abstract

Typhoon is a series of Thai large language models (LLMs) developed specifically for the Thai language. This technical report presents challenges and insights in developing Thai LLMs, including data preparation, pretraining, instruction-tuning, and evaluation. As one of the challenges of low-resource languages is the amount of pretraining data, we apply continual training to transfer existing world knowledge from a strong LLM. To evaluate the Thai knowledge encapsulated in each model from the pretraining stage, we develop ThaiExam, a benchmark based on examinations for high-school students and investment professionals in Thailand. In addition, we fine-tune Typhoon to follow Thai instructions, and we evaluate instruction-tuned models on Thai instruction datasets as well as translation, summarization, and question-answering tasks. Experimental results on a suite of Thai benchmarks show that Typhoon outperforms all open-source Thai language models, and its performance is on par with GPT-3.5 in Thai while having only 7 billion parameters and being 2.62 times more efficient in tokenizing Thai text.
Paper Structure (13 sections, 1 equation, 1 figure, 5 tables)

This paper contains 13 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Performance of Typhoon and other open-source Thai large language models on Thai Examinations (left) and Thai instruction-following (right). Details about ThaiExam and Thai instruction-following evaluation are provided in Section \ref{['section:pretrained_eval']}, and Section \ref{['section:instruction_eval']}, respectively.